Window Function

The Window Function performs calculation across a set of related table rows. The set of related rows is called a Partition or a Window Frame.

When to use

Use window functions when you want to calculate something based on the relation of rows with respect to each other. Such scenarios may include:

  1. Cumulating values across a set of rows or finding minimum, averages etc.
  2. Ranking rows, finding difference between previous and current row
  3. Find relationship among values within a Partition.

Quick Start

Look at the following data:

Window function example

In the above example, there are 3 partitions of related rows. The rows in each partition share the same entry in the column Group Name namely Laptop, Smartphone and Tablet

For example, we would like to determine the following:

  1. The total revenue for every group in Group Name
  2. Regroup the data into 3 major price groups - Price Group

To calculate the total revenue for every group in Group Name, the sum of the column price needs to be evaluated in each partition. The data appears as shown below:

Sum over a partition

The data has been partitioned by the column Group Name and calculated total of column Revenue within each column. This is achieved by choosing Apply Partitioning option and choosing the right column to Partition By

Note how values in Group Sum is displayed as an aggregate value. This is done by setting the Calculate As option to Aggregate over partition

Sum over a partition configuration

When aggregating data, unlike the group task, the use of a window function does not cause rows to be grouped into a single output row with aggregated values in a new column. The rows retain their separate identities and aggregated values are repeated across each of the rows.

You can can also calculate a running sum of the column Revenue over this partition.

Running total over a partition

This is done by setting the Calculate As option to Accumulate over partition

Running total over a partition configuration

Partitioning is not mandatory. You can calculate the running total over the entire data like this:

Running total without partition

This is done by not choosing to partition over a column, by unchecking the Apply Partitioning option.

Running total without partition configuration

Regroup the data into 3 major price groups - Price Group

The data appears as shown below:

Regrouping data based on price

This is done by the NTile function. Note that there is no partition chosen, and the only option used is sort by Revenue column.

Notice that the sort option in the configuration says Revenue(asc). This option decides the column based on which the data is grouped and sorted.

The configuration is shown below:

Regrouping data based on price configuration

Supported Options

Following options are provided by Window Functions:

  1. Calculate: To specify the function you want to use.
  2. Of Column: To specify the column which the function should be use. This is only visible when the function selected in Calculate section requires a column.
  3. Apply Partitioning: To specify if the data should be partitioned.
  4. Partition by: To specify which column to partition. This is only visible when Apply Partitioning is chosen.
  5. Calculation Type: To specify the type of operation over partition that the function chosen in Calculate section will operate on. Supports two options, Accumulate over partition or Aggregate over partition
  6. N: The value for N. This shows up only when the function selected is NTILE or NTH_VALUE.
  7. Sort: To specify the sort for the window. Check usage for sort here. Also see Sorting in tasks.
  8. Apply result into: With this you can apply the result into a new column of name. See result documentation

Types of Functions

The Window functions supported by Mammoth fall into following categories.

  1. Aggregation Functions
  2. Ranking Functions
  3. Relative Functions

Aggregation functions

Refer Quickstart for practical usage of the functions. Also check Accumulate over partition and Aggregate over partition before diving into the function definitions and usage.

All the calculations using the following functions can be divided by a partition, and if a partition is not specified, then the calculations are done considering the entire data as the window.

Sum

Calculate the sum for a given numeric column.

The data appears as shown below:

Calculating sum

The configuration is as shown below:

Calculating sum configuration

Average

Calculate the average for a given numeric column.

The data appears as shown below:

Calculating average

The configuration to calculate the average over every group is shown below:

Calculating average configuration

Maximum

Determine the maximum value in a given numeric column.

The data appears as shown below:

Calculating Max value

The configuration is shown below:

Calculating Max value configuration

Minimum

Similar to Maximum, calculate the minimum value in a given numeric column.

The data appears as shown below:

Calculating Min value

The configuration is shown below:

Calculating Min value configuration

Standard Deviation

By definition, Standard deviation is a number used to tell how measurements for a group are spread out from the average/mean, or expected value. A lower value means that most of the numbers are close to the average. A higher value means that the numbers are more spread out.

The data appears as shown below:

Calculating Standard Deviation

The configuration is shown below:

Calculating Standard Deviation configuration

Variance

Variance is used for measuring the spread of data, similar to Standard Deviation. It grows very large when there is uneven spread in the distribution of data.

The data appears as shown below:

Calculating Variance

The configuration is shown below:

Calculating Variance Configuration

Count

Calculate the number of occurences of a value in any column. Demonstrating this using the same example as above to check the number of times each month occurs, which can later be filtered using explore cards.

The data apepars as shown below:

Calculating Count

The configuration is as below:

Calculating Count configuration

Accumulate over partition

When you create an operation that accumulates over partition, it would create values in rows that successively apply the same operation over each row and accumulate that value. In the below table the Accumulated column shows marks Accumulated over the Student Partition as sum

Student Subject Marks Accumulated
Alice Math 87 87
Alice Science 92 179
Alice English 87 266
Alice Economics 89 355
Morris Math 93 93
Morris Science 89 182
Morris English 82 264
Morris Economics 97 361

Aggregate over partition

When you create an operation that aggregates over partition, it would create values in rows that provide the result of operation over all values across all rows for a column. In the below table the Aggregated column shows marks Aggregated over the Student Partition as sum

Student Subject Marks Aggregated
Alice Math 87 355
Alice Science 92 355
Alice English 87 355
Alice Economics 89 355
Morris Math 93 361
Morris Science 89 361
Morris English 82 361
Morris Economics 97 361

Ranking functions

The following ranking functions are supported By Mammoth. Ranking functions always require a column on which the data is sorted.

Row Number

Evaluate the number of the current row withing its partition, starting from 1. In the example, we assign row numbers for every item on Product Name with partitioned by Group Name.

The data appears as below:

Row number

The configuration to achieve this is shown below:

Row number Configuration

Rank

Rank of the current row with gaps. Gaps mean that if there are multiple rows with the same values, the subsequent number that was repeated would be skipped. For example, In a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 4.

A good use case would be to calculate the best selling product in each group as demonstrated in the example. The partition by option decides the window, while the sort option decides the column based on which ranks are assigned, and the sort order of the data.

The data appears as shown below:

Rank

The configuration to achieve this is shown below:

Rank Configuration

Dense Rank

Similar to rank, but without gaps. For example in a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 3

The data appears as shown below:

Dense Rank

The configuration to achieve this is shown below:

Dense Rank Configuration

Percent Rank

This represents the relative rank of the current row. Mathematically represented as: (rank - 1) / (total rows - 1). The value of percent rank ranges from 0-1, the first row always having ‘0’ as the value, increasing from there. The best and least grossing products in the given data is computed with this function.

The data appears as shown below:

Percent Rank data

The configuration is as shown below:

Percent Rank Configuration

Cumulative Percent Rank

Relative rank of the current row: (number of rows preceding or peer with current row) / (total rows). Check the example below, from the cumulative percent rank values, it is evident that over 55% products generate a revenue more than 600.

The data appears as shown below:

Cumulative Percent Rank

The configuration is as shown below:

Cumulative Percent Rank Configuration

Ntile

Integer ranging from 1 to the argument value, dividing the partition as equally as possible. This requires an additional option N to be set. Please note that setting N to 100 will not show you percentile unless there are more than 100 rows in that partition.

NTile function divides a result set into a number of buckets and assign a bucket number to each row. Check the data below divided almost evenly among the rankings 1 through 4.

The data appears as shown below:

Ntile

The configuration is shown below:

Ntile configuration

Relative functions

First Value

Find the first value in a given column in a partition. In the given example, we are determining the least grossing product in every group.

The data appears as shown below:

First Value

Determine the least grossing product by using sort. This option determines the value by which first value is determined, and the sort order.

The configuration is shown below:

First Value configuration

Last Value

Simlar to First Value, this function is used to determine the last value in a given column in a partition.

The data appears as shown below:

Last Value

The configuration is shown below:

Last Value configuration

Nth Value

The Nth Value allows you to get a value from the Nth row in an ordered set of rows. In the example, we have set the value of ‘N’ to 2, applied partitioning by Group Name and sorted Revenue column in descending order to determine the second best seller.

The data appears as shown below:

Nth Value Value

The configuration is shown below:

Nth Value Configuration

Lead

Find the value in a given column in the next row. This will be empty for the last row in a given window frame. The Lead function can be very useful for comparing the value of the current row with the value of the following row.

The data appears as shown below:

Lead Value Value

The configuration is shown below:

Lead Value Configuration

Lag

Find the value in a given column in the previous row. This will be empty for the first row in a given window frame.

The data appears as shown below:

Lag Value Value

The configuration is shown below:

Lag Value Configuration

See also

Number Formatting

The numeric formatter documentation

Sort in tasks

Explanation on how sorting works in this task.

Numeric functions

Description of the functions