Window Function¶

The Window Function performs calculation across a set of related rows called a Partition, Group or a Window Frame.

Table of Contents

When to use ¶

Use window functions when you want to calculate Sum, Average, Rank, Percentile, and so on over a partition of data. Each group is considered a window and the rule is applied to every window. For instance, you can use the Window Function to find:

Cumulating sum or average values across a set of rows

Rank of rows in each group

Difference between the previous and the current row

Relationship among values within a group, amongst others.

Quick Start ¶

Look at the following data:

In the above example, the data is grouped in 4 groups. The rows in each group share the same entry in the column Group namely Mocktails, Starter, Main Course, and Ice-Cream.

Suppose we would like to do the following:

Determine the total revenue for each group
Regroup the data into 4 major revenue groups

To calculate the total revenue for each of the four groups in Group, we will use the SUM (Aggregate) function on the Revenue column, partitioned by Group. Like this:

The data appears as shown below:

You can can also calculate a running sum of Revenue over the same group.

This is done by choosing the Type as Accumulate.

Running total over a group configuration

Note that grouping is NOT mandatory. You can calculate the running total of the entire data as a single window as well:

This is done by not adding grouping.

Running total without group configuration

Note

Unlike the Group & Aggregate function, Window function does not return a single output row against each group. Instead, each row retains its separate identity and aggregated values are repeated across each of the rows.

Regroup the data into 4 major revenue groups.

The data appears as shown below:

This is done by the NTile function. Note that here no group is chosen, and the only option used is sort by Revenue (Asc) column. This option decides the column based on which the data is sorted.

The configuration is shown below:

Regrouping data based on price configuration

Supported Options ¶

Following options are provided by Window Functions:

Calculate: To specify the function you want to use.

Of Column: To specify the column which the function should use. This is only visible when the function selected in Calculate section requires a column.

Add Group: To specify partitioning or grouping

Type: To specify the type of operation over grouping that the function chosen in Calculate section will operate on. Supports two options, Accumulate over partition or Aggregate over partition

N: To specify the number of buckets in the NTile function and the value of N in Nth_Value function. This option shows up only when NTILE or NTH_VALUE functions are selected.

Add Sort: To specify the sort for the window. Check usage for sort here. Also see Sorting in tasks.

Apply result into: With this you can apply the result into a new column or an existing one. See result documentation.

Limit ranking: To limit the results to (or upto) a specified rank number. Say you set the Limit ranking to 3. You’ll get rows with ranks - 1, 2 and 3 only. All other rows will be ignored. See Limit Ranking illustration here.

Types of Functions ¶

The Window functions supported by Mammoth fall into three broad categories:

Aggregation Functions
Ranking Functions
Relative Functions

Aggregation functions ¶

Aggregation functions include functions like - Sum, Average, Maximum, Minimum, Standard deviation, Variance, and Count. All the calculations using the following functions can be divided by a group. In case a group is not specified, then the calculations are done considering the entire data as a single window.

Refer Quickstart for practical usage of the functions. Also get familiar with Accumulate over partition and Aggregate over partition before diving into the function definitions and usage.

Sum ¶

This function calculates the sum for a given numeric column.

The data appears as shown below:

The configuration is as shown below:

Check out how you can calculate running total over grouped & non-grouped data.

Average ¶

This function calculates the average of a given numeric column.

The data appears as shown below:

The configuration to calculate the average (aggregated) over grouping is shown below:

Calculating average (aggregate) configuration

You can also calculate running average by setting the Type as Accumulate.

Calculating average (accumulate) configuration

Configuration for Average (accumulate) is as follows:

Maximum ¶

This function determines the maximum value in a given numeric or date column.

The data appears as shown below:

The configuration is shown below:

Calculating Max value (aggregate) configuration

Calculate running maximum value by setting Type as Accumulate.

The configuration is shown below:

Calculating Max value (accumulate) configuration

Minimum ¶

Similar to Maximum, calculate the minimum value in a given numeric or date column.

The data appears as shown below:

The configuration is shown below:

Calculate running minimum by setting Type as Accumulate.

The configuration is shown below:

Standard Deviation ¶

Standard deviation represents how far a data deviates from its average/mean, or expected value. A lower value means that most of the numbers are close to the average. A higher value means that the numbers are more spread out.

The data appears as shown below:

The configuration is shown below:

Calculating Standard Deviation configuration

Calculate the running Standard Deviation by setting Type as Accumulate.

The configuration is shown below:

Variance ¶

Variance is used for measuring the spread of data, similar to Standard Deviation. It grows very large when there is uneven spread in the distribution of data.

The data appears as shown below:

The configuration is shown below:

You can calculate running Variance by choosing Accumulate as Type.

The configuration is shown below:

Count ¶

Count represents the number of occurrences of a value in any column.

The data apepars as shown below:

The configuration is as below:

You can calculate running count by choosing Accumulate as setting Type.

The configuration is as below:

Accumulate over partition

When you create an operation that accumulates over grouping, it would create values in rows that successively apply the same operation over each row and accumulate that value. In the below table the Accumulated column shows marks Accumulated over the Student group as sum

Student	Subject	Marks	Accumulated
Alice	Math	87	87
Alice	Science	92	179
Alice	English	87	266
Alice	Economics	89	355
Morris	Math	93	93
Morris	Science	89	182
Morris	English	82	264
Morris	Economics	97	361

Aggregate over partition

When you create an operation that aggregates over grouping, it would create values in rows that provide the result of operation over all values across all rows for a column. In the below table the Aggregated column shows marks Aggregated over the Student group as sum

Student	Subject	Marks	Aggregated
Alice	Math	87	355
Alice	Science	92	355
Alice	English	87	355
Alice	Economics	89	355
Morris	Math	93	361
Morris	Science	89	361
Morris	English	82	361
Morris	Economics	97	361

Ranking functions ¶

The following ranking functions are supported By Mammoth. Ranking functions always require a column on which the data is sorted.

Row Number ¶

Row number evaluates the number of the current row within its group.

The data appears as below:

The configuration to achieve this is shown below:

Rank ¶

The Rank function ranks the rows based on values in a column, but with gaps. Here, gaps mean that if there are multiple rows with same values for the selected column, the subsequent number that was repeated would be skipped. For example, In a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 4.

A good use case would be to calculate the best selling product in each group as demonstrated in the example. The Add group option decides the grouping, while the sort option decides the column based on which ranks are assigned, and the sort order of the data.

The data appears as shown below:

The configuration to achieve this is shown below:

You can choose to limit results with the Limit ranking option. Like this:

The data appears as follows:

Dense Rank ¶

Similar to rank, but without gaps. For example in a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 3.

The data appears as shown below:

The configuration to achieve this is shown below:

Percent Rank ¶

This represents the relative rank of the current row. Mathematically represented as: (rank - 1) / (total rows - 1). The value of percent rank ranges from 0-1, where the first row always has ‘0’ as the value. The best and least grossing products in the given data is computed with this function.

The data appears as shown below:

The configuration is as shown below:

Cumulative Percent Rank ¶

Relative rank of the current row: (number of rows preceding or peer with current row) / (total rows). Check the example below, from the cumulative percent rank values, it is evident that over 55% products generate a revenue more than 600.

The data appears as shown below:

The configuration is as shown below:

Ntile ¶

The Ntile function tries to segregate the data into buckets as equally as possible. The number of buckets created is defined by the value of N. The size of each bucket is calculated by dividing the total number of rows by N, and is chosen as the nearest integer <= the quotient.

Let’s understand this with the help of a dataset. Suppose we have this dataset and we want to segregate this in 4 revenue groups. We would specify N as 4.

Since here the total number of rows is 13, the system will create 4 buckets each of size 13/4= 3.22, i.e. approx 3 rows in each bucket. The remaining rows will be adjusted equally in each bucket, starting from bucket 1.

The data appears as shown below:

The configuration is shown below:

Note

When N>=the total number of rows, the system creates as many buckets as possible. Say you have a dataset with 20 rows and you try to create 100 buckets. The system will try to create the largest possible number of buckets, here 20. The size of each bucket will be 1 in this case.
To calculate percentile, you must have a dataset with more than 100 rows in that group. Setting N to 100 will not show you percentile otherwise.

Relative functions ¶

First Value ¶

It displays the first value of a given numeric or date column in an ordered dataset. In the given example, we are determining the least grossing product in every group.

The data appears as shown below:

Determine the least grossing product by using sort. This option determines the value by which first value is determined, and the sort order.

The configuration is shown below:

You can also calculate the running first value by setting the Type as Accumulate.

Last Value ¶

Simlar to First Value, this function is used to determine the last value in a given numeric or date column in a group.

The data appears as shown below:

The configuration is shown below:

You can also calculate the running last value by setting the Type as Accumulate.

Nth Value ¶

The Nth Value allows you to get a value from the Nth row in an ordered set of rows. In the example, we have set the value of ‘N’ to 2, applied partitioning by Group Name and sorted Revenue column in descending order to determine the second best seller.

The data appears as shown below:

The configuration is shown below:

You can also calculate the running Nth value by setting the Type as Accumulate.

Lead ¶

Find the value in a given column in the next row. This will be empty for the last row in a given window frame. The Lead function can be very useful for comparing the value of the current row with the value of the following row.

The data appears as shown below:

The configuration is shown below:

Lag ¶

Find the value in a given column in the previous row. This will be empty for the first row in a given window frame.

The data appears as shown below:

The configuration is shown below: