Fill Missing Values with Grouping¶
Filling missing values with grouping is a unique function which segregates the data and fills the null or empty values in a Dataset.
Let’s take a sample data set “Hospital” which shows the information about certain patients. You can download this sample data
|Row Number||Patient entry ID||Date||Name of Patient||Blood Group||Any prior history?||Address|
|4||0004||14/09/2017||John||AB+||High blood pressure||San Diego|
In this Dataset, there are three columns with missing values - ‘Blood Group’, ‘ Prior medical history’ and ‘Address’. It is evident that the blood group of a person never changes, therefore we fill the missing values in the column ‘Blood Group’.
Let’s take the column ‘Blood Group’ and fill the missing values within the group (Name of patient) from above. Follow the steps below:
Open Data Preparation menu and click on Column Functions and then select Fill Missing Values .
Select Blood Group column from Fill empty cells in drop-down.
Select Patient Entry ID column and ascending/descending order in order by drop-down.
Select above/below from the drop-down ‘with values from’ to fill the missing value in the column with the value above or below it.
Click on Add Group and select Name of the patient column for the grouping of the data.
Click on APPLY .
How it Works¶
Grouping by column divides the table into groups on the basis of data in the selected column. Filling from above or below is done on the groups thus formed.
Let’s take the sample Dataset of a hospital and break it down step by step.
- Add a grouping rule and select the column which you want to use to segregate the data. For example, let’s use the Name of the patient column. This rule divides the table based on the similar values in the same column. In this table, four groups are formed. We can visualize this as:
Values from Above / Below¶
When you select ‘above’ or ‘below’ from the ‘with values from’ drop-down, it fills the empty cell of the column with value in the cell just above/below it. If we fill a group from above (using default row, as ‘order by’) in the table “Hospital”, it will fill the data from above in the four groups formed separately. Similarly, on selecting ‘below’, it will fill the data from cells below and then show it together.
We can visualize fill missing values from Above as follows:
The resultant table will be:
Similarly, fill missing values from below can be visualized as:
The resultant table will be:
To fill from above or below, you would need to order the data first. ‘Order by’ option allows you to sort your data according to the specified column in ascending or descending order. By default, ordering of data is done by the row number. For example, select Patient Entry ID column and ‘ascending’ in the drop-down menu labelled ‘order by’ to sort the data in ascending order.
Alternatively, you can order them on the home screen itself. At the bottom of the screen, there is a ‘Sort by’ option. If you sort the table on the home screen, it becomes the default option in the fill missing values menu.
In the above example, if you select the ‘Date’ column for ordering the data, it leads to an ambiguity as the data in that column is not unique, there is duplicate data in several cells. Since row number is a unique, duplicate value in the chosen order column get further resolved by row number.
The ‘Add Group’ option however is optional. When grouping is not applied, whole data is treated as a single group. Thus the missing values is filled by the value above or below based on the ordering over complete data.
The Window Function documentation