Fill Missing Values with Grouping

Filling missing values with grouping is a unique function which segregates the data and fills the null or empty values in a Dataset.

Quick Start

Let’s take a sample data set “Hospital” which shows the information about certain patients. You can download this sample data here .

Table 15 Hospital → View 1
Row Number Patient entry ID Date Name of Patient Blood Group Any prior history? Address
1 0001 06/01/2018 Luke A+   Minneapolis
2 0002 22/03/2018 Nguyen B+ Diabetic Ohio
3 0003 22/03/2018 Nguyen      
4 0004 14/09/2017 John AB+ High blood pressure San Diego
5 0005 06/01/2018 Luke     Minneapolis
6 0006 26/11/2017 Anne O- Obesity Kansas city
7 0007 06/01/2018 Luke A+   Minneapolis
8 0008 06/01/2018 Luke   Alzheimer  

In this Dataset, there are three columns with missing values - ‘Blood Group’, ‘ Prior medical history’ and ‘Address’. It is evident that the blood group of a person never changes, therefore we fill the missing values in the column ‘Blood Group’.

Let’s take the column ‘Blood Group’ and fill the missing values within the group (Name of patient) from above. Follow the steps below:

  1. Open Data Preparation menu and click on Column Functions and then select Fill Missing Values .

    menu

    Fig. 31 Fill Missing Values option in Data Preparation menu.

  2. Select Blood Group column from Fill empty cells in drop-down.

    select column

    Fig. 32 Select a column to fill the empty values.

  3. Select Patient Entry ID column and ascending/descending order in order by drop-down.

    Order by

    Fig. 33 Ordering the Dataset in ascending/descending order on the basis of column selected.

  4. Select above/below from the drop-down ‘with values from’ to fill the missing value in the column with the value above or below it.

    Above or Below

    Fig. 34 Above or below option in with values from drop-down.

  5. Click on Add Group and select Name of the patient column for the grouping of the data.

    Grouping

    Fig. 35 Pick a column from Group by to group the data

  6. Click on APPLY .

How it Works

Grouping by column divides the table into groups on the basis of data in the selected column. Filling from above or below is done on the groups thus formed.

Let’s take the sample Dataset of a hospital and break it down step by step.

  1. Add a grouping rule and select the column which you want to use to segregate the data. For example, let’s use the Name of the patient column. This rule divides the table based on the similar values in the same column. In this table, four groups are formed. We can visualize this as:
Grouping example

Fig. 36 Division of Dataset into Groups

Values from Above / Below

When you select ‘above’ or ‘below’ from the ‘with values from’ drop-down, it fills the empty cell of the column with value in the cell just above/below it. If we fill a group from above (using default row, as ‘order by’) in the table “Hospital”, it will fill the data from above in the four groups formed separately. Similarly, on selecting ‘below’, it will fill the data from cells below and then show it together.

We can visualize fill missing values from Above as follows:

Fill from above Example

Fig. 37 Fill missing values from Above

The resultant table will be:

Result on filling from above

Fig. 38 Final outcome after filling values from Above

Similarly, fill missing values from below can be visualized as:

Fill from below Example

Fig. 39 Fill missing values from Below

The resultant table will be:

Result on filling from below

Fig. 40 Final outcome after filling values from Below

Ordering

To fill from above or below, you would need to order the data first. ‘Order by’ option allows you to sort your data according to the specified column in ascending or descending order. By default, ordering of data is done by the row number. For example, select Patient Entry ID column and ‘ascending’ in the drop-down menu labelled ‘order by’ to sort the data in ascending order.

Alternatively, you can order them on the home screen itself. At the bottom of the screen, there is a ‘Sort by’ option. If you sort the table on the home screen, it becomes the default option in the fill missing values menu.

Sorting

Fig. 41 Sort by option at the home screen

In the above example, if you select the ‘Date’ column for ordering the data, it leads to an ambiguity as the data in that column is not unique, there is duplicate data in several cells. Since row number is a unique, duplicate value in the chosen order column get further resolved by row number.

NOTE

The ‘Add Group’ option however is optional. When grouping is not applied, whole data is treated as a single group. Thus the missing values is filled by the value above or below based on the ordering over complete data.

See also

Window Function

The Window Function documentation