Data Cleaning in Excel: Best Techniques and Tips

In a utopian world, data is always clean and properly formatted. But in reality, data is riddled with formatting issues, duplicate rows, missing values, spelling mistakes, improper capitalizations, leading or trailing spaces, redundant prefixes/suffixes, lack of structure, and so much more.

Working with clean data is a prerequisite to getting meaningful insights from your data. That said, data analysts like me can tell you how tedious and frustrating data cleaning often is. If you haven’t used the right tools for data cleaning in Excel, you would know how complex and manual the process is. 

For years, I was struggling to clean data in Excel, repeating the process over and over again every single time. The complexity of the features and the manual effort involved haven’t eased beyond a point. 

In their desperation to automate the process, many analysts have resorted to makeshift tools, coding, or add-ons. Some of these tools have worked well for their specific cases, but they can’t be generalized, and you’ll have to work on the same data cleaning task over and over again. Instead of these one-off tools you need something with visualization, automation, and scalability. 

If you’re a professional still looking to skip past the repetitive process of data cleaning to more meaningful work, I have something for you.

Stick around to learn how to clean your data in two ways: 1) Manually in Excel (best suitable for one-off tasks) and 2) Automatically in Mammoth (best suitable for continuous data management).

Here are the 8 most common data cleaning in Excel tasks:

  1. Remove duplicates
  2. Standardize formats
  3. Even out casing and remove extra spaces
  4. Split delimited data
  5. Find and replace
  6. Extract prefixes and suffixes
  7. Check for spelling and typos
  8. Fill missing values

Before you go ahead with any cleaning, you should understand the origins of your data — how it was recorded, where & how it is stored, how it is imported, etc.

Often, different bits of data are stored as different datasets, and in different databases. When you bring together this data, you might see duplicated rows. 

Once you have an understanding of the origins of the data and the duplication, this is how you can remove it: 

In Excel

To remove duplicates in Excel,

  • Select a column or range
  • Go to the “Remove duplicates” tool under the Data option and you’ll see a modal like this one:
Removing duplicates screens in Excel
  • Select the columns you want to remove duplicates from and check the headers box if your dataset has headers
  • Click OK.

While this looks straightforward, it gets repetitive with every data pull. Here’s how to do it once and get automated data cleaning for Excel files.

In Mammoth

To remove duplicates in Mammoth,

Remove Duplicates function in Mammoth
  • Ignore columns if you want to, otherwise apply the rule on the whole dataset
Removing duplicate rows in Excel
  • Click Apply.

The task gets recorded in the pipeline and runs every time there is new data inflow in the dataset, not requiring additional effort from your end.

A data transformation task pipeline in Mammoth showing the actions user is about to take.

At Mammoth, our clients often need help with duplicate data and disparate values between different datasets. For example, Starbucks spent 100+ hours cleaning and unifying country-specific values monthly. Now, with the help of Mammoth’s automated data cleaning tools, they consolidate 1B+ rows of multi-country sales data into insights without manual work.

One of my most common challenges with data cleaning is fixing format and structure. While most databases contain structured data, there are times when formats are wrongly detected, or there are mismatches in data formats from two complementary datasets.

Detecting and assigning the correct data type and format is crucial to performing the right calculations. Incorrect data types can even restrict you from performing datatype-specific functions. 

For example, if your numeric data is formatted as text, you’ll not be able to perform arithmetic functions on them. Or if a date column is formatted as dd-mm-yy in one dataset, and as a time stamp dd-mm-yy hh:mi:ss in another, there will be a mismatch.

In Excel

In Excel, you can use Power Query to detect and change the data type and formats.

Here’s how,

  • Go to Power Query
  • Select a data source. It will automatically detect data types based on the column values
  • Go to the Query > Edit
  • Select a column, and then select Transform > Detect Data Type
  • Choose a data type/format from the drop-down.

Alternatively, you can use the Text to Column function in Excel to change the data types & formats. 

This is how,

  • Select a cell or column
  • Rght-click on the cell or column and select Format Cells from the menu. Alternatively, press ctrl+1 in Windows or Command+1 in Mac to open the Format widget
  • Change the data type or the format of the column
  • Click OK.
Format Cells wizard in Excel

Changing data types in Excel is a lot of work especially when you want to change data types of multiple columns at a time.

In Mammoth

You can change data types of multiple columns at once,

  • Select a single column or select multiple columns from the column browser
  • Choose the Convert data type function
  • Select datatypes for the columns
  • And apply all at once.
Standardizing data in Mammoth

This task also gets added to the pipeline and runs automatically on data updates.

Are you struggling with Excel data cleaning and repeating the same tasks over and over again? Sign up for Mammoth’s free version to see how easy it is to get rid of repetitive data cleaning tasks.

Casing and extra spaces in text columns are again basic formatting issues analysts regularly come across. Having consistent formatting across your datasets prevents erroneous calculations or blank cells.

In Excel

Fixing casing in Excel is quite a manual process,

Here’s how you can do it,

  • Insert a new column (Say col B) next to the text column
  • In the new column, write the formula –

    =UPPER(cell address) – Converts to UPPERCASE 
    =LOWER(cell address) – Converts to lowercase
    =PROPER(cell address) – Converts to Proper Case
  • Drag to apply to all the cells in the column
  • Copy the values in col B and Paste the values (without the formula) in the text column.
  • Now, delete column B.

Removing extra spaces can be a task of its own in Excel. The TRIM function in Excel removes all leading, trailing, and extra spaces barring one space between words. 

Here’s how you can use it,

  • Insert a new column (say col B) next to the text column you want to trim
  • Write the formula – TRIM(cell address) in the new column
  • Drag to all the cell
  • Copy the values from col B and paste them into the text column. All extra spaces will be removed.

In Mammoth

To fix casing or extra spaces in Mammoth, you don’t need to use separate functions. Just use the Text formatting function and fix everything in one go, like this –

  • Go to the Transform menu
  • Find the Text formatting under Text functions
  • Change the casing of text functions and normalize spacing
  • Click Apply.
Changing casing in Mammoth

In Mammoth, you apply the rule once and it gets automated forever. On new data pulls, the data goes through the same data-cleaning steps automatically. You are not required to clean new data again and again.

Many times a few columns contain a lot of information delimited by a comma or space. You might need to split them into multiple columns for data processing. This process is also called Data Parsing.

In Excel

Parsing data in Excel goes like this,

  • Select the cell or column that contains the text you want to split.
  • Select Data > Text to Columns.
  • In the Convert Text to Columns Wizard, select Delimited > Next.
Splitting delimited data in Excel
  • Select the Delimiters for your data. For example, Comma, Space, Semicolon, or Tab. You can see a preview of your data in the Data preview window.
  • Select Next.
  • Select the Destination in your worksheet which is where you want the split data to appear.
  • Select Finish.

NOTE: Be careful when defining the destination columns, otherwise the information in the existing columns may be overwritten.

In Mammoth

To fix casing or extra spaces in Mammoth, you just have to,

Navigating to the Split column function in Mammoth
  • Select the column you want to split into multiple columns
  • Specify the delimiter
  • Select the number of columns you want to split the column into
  • Enter names for the new columns
  • Click Apply.
Splitting columns in Mammoth

Instead of going through multiple steps in Excel, you can apply the function from a single widget in Mammoth. Further, you can split the column into as many columns as you like. Not to mention, the rule automatically applies to all new data in the future.  

Are you cleaning Excel data continuously? Sign up for Mammoth’s free version today and learn how to save time with automation.

Datasets may store values in multiple variations. For instance, Cupcake could be stored as cupcake, CUPCAKE, CUPcake, cUPcake, and so on. It’s important to standardize these values before processing them. The data cleaning cheat sheet might come in handy.

In Excel

There is no bulk way to replace multiple values in Excel, however, you can use the Find and Replace function to replace values one by one.

Here’s how,

  • Select the cell or column
  • Go to Home > Find and Replace
  • Enter a value in the Find what option
  • Enter the value you want to replace it with
  • Click on “Replace All” if you want to apply it to the whole sheet.
The find and replace functionality in Excel

In Mammoth

Mammoth offers two functions to standardize values – Find and Replace and Bulk Replace

The Find and Replace function can be applied to multiple columns, or even the whole dataset at a time. 

Whereas the Bulk Replace function works on one column at a time but is a much more intuitive function than the Find and Replace function. Standardizing values works like a charm with the Bulk Replace function. 

This is how you can replace the values with Find and Replace in Mammoth,

  • Go to the Transform menu
  • Find the Find and Replace function under Label, Filter and Replace
  • Select a column
  • Enter the value you want to replace
  • Enter a new value you want to replace with
  • Click Apply
Find and replace in Mammoth

And, with the Bulk Replace function,

  • Go to the Transform menu
  • Find the Bulk Replace function under Label, Filter, and Replace
  • Select a column
  • Put all similar values in a group
  • Enter a new value for each group
  • Click Apply.
Bulk Replace function in Mammoth

And just like that you have replaced multiple values in a column with a fraction of effort. This will also apply to all new data the dataset may get.

Inspired Learning Group (ILG), an organization operating 23 schools across the UK, is one of the many Mammoth customers using the ‘Find and Replace’ feature in an automated way. ILG’s leadership team receives individual reports from different schools, and with Mammoth’s automated process, they can effortlessly consolidate all Excel files and visualize the data in automated Power BI dashboards – all without the help of the IT team.

Certain values such as OrderID, ItemID, or Model number often nest a lot of information like demography details, manufacturing details, processing details, and so on. 

While this could be valuable information for analysis, it’s difficult to use until we extract it as a separate value.

In Excel

In Excel, we use the LEFT and RIGHT functions to extract prefixes or suffixes.

This is how,

  • Insert a new column next to the column you want to extract text from
  • Select the cell or column
  • Write the formula – =LEFT(text, [number of chars])
  • Drag the rule to apply to all the cells.
LEFT function in Excel

In Mammoth

You can simply use the “Extract Text” to extract text,

  • Go to the Transform menu
  • Find the Extract Text under Text functions
  • Change the casing of text functions and normalize spacing
  • Click Apply.
Extracting text in Mammoth

Misspelled terms or typos can occur anywhere. It’s good practice to check your data for spelling mistakes.

In Excel

The spell check function in Excel is great for spell checks,

  • Select a cell or column
  • Go to Review > Spelling
  • Excel would suggest the correct spelling or similar words if there are wrong spellings. You can choose the value you want and it will get replaced everywhere in the column.
  • If all spellings are correct, it will show a checkmark.

In Mammoth

Mammoth currently doesn’t offer spell check as a built-in feature.

Missing values stem from poor data records or data entry errors. You can choose to either remove the missing values or impute and fill them. 

Having complete data helps in data processing & modeling and you should strive to have zero blank cells in your data.

However, if there are missing values in a column that you are not going to need at all for your analysis, you can simply ignore those.

In Excel

To fill in missing values in Excel,

  • Highlight the range starting before and after the missing values
  • Click Home > Editing > Fill > Series.

This might not work if the data is not in a sequential order. 

In Mammoth

To fill in missing values in Mammoth,

  • Go to the Transform menu
  • Find the Fill missing values under Column functions
  • Select if you want to fill with the value above or with the value after
  • Click Apply.

Conclusion

If you are still going through the same rigorous data cleaning routine every time, this is the time to change. Adapt to intuitive and automated solutions like Mammoth Analytics for smooth data management and processing.

To automate your data cleaning and transformation work, sign up for Mammoth today!

Related Posts

Mammoth Analytics achieves SOC 2, HIPAA, and GDPR certifications

Mammoth Analytics is pleased to announce the successful completion and independent audits relating to SOC 2 (Type 2), HIPAA, and GDPR certifications. Going beyond industry standards of compliance is a strong statement that at Mammoth, data security and privacy impact everything we do. The many months of rigorous testing and training have paid off.

Read More

Announcing our partnership with NielsenIQ

We’re really pleased to have joined the NielsenIQ Connect Partner Network, the largest open ecosystem of tech-driven solution providers for retailers and manufacturers in the fast-moving consumer goods (FMCG/CPG) industry. This new relationship will allow FMCG/CPG companies to harness the power of Mammoth to align disparate datasets to their NielsenIQ data.

Read More

Hiring additional data engineers is a problem, not a solution

While the tendency to throw in more data scientists and engineers at the problem may make sense if companies have the budget for it, that approach will potentially worsen the problem. Why? Because the more the engineers, the more layers of inefficiency between you and your data. Instead, a greater effort should be redirected toward empowering knowledge workers / data owners.

Read More