Concepts

Introduction

Mammoth is a data management platform that allows you to take your data from its raw state to insights with all the steps in between. More specifically Mammoth allows you to:

  1. Upload data or fetch data from various data sources
  2. Warehouse this data
  3. Transform data through an iterative process of discovery, preparation, blending, and insights
  4. Save process steps to a task pipeline so that any data changes in the future can go through the same steps automatically
  5. Use the prepared data to create dashboard or generate reports either in Mammoth or in a dashboard system outside of Mammoth

You use Mammoth to implement a data flow pipeline where your data traverses through successive transformations to produce output. The process is feedback driven where data discovery happens along the way. Here is a sample of a simple data transformation process using Mammoth.

Data flow in Mammoth

Fig. 1 A typical data flow in Mammoth

Terminology

Dataset

Dataset contains your original sourced data. It is organized as a grid of columns and rows. Each column has a specific data type (numeric, text, or date). The data can come from various sources. You can update a dataset by replacing or appending the new data. A change in data in a dataset updates all the views built on this dataset.

A batch of data is created every time the data is appended either through file upload or through scheduled data pull from databases or APIs. You can view data batches in preview panel for a dataset. You can also delete a specific batch(es) of data.

Task

A task is a transformation that you perform on your data. A task is created in a View. A task adds a layer of change on top of the original dataset and then each new task adds a new layer of transformation on the data of the previous state. Deleting a task brings the data back to its previous state. Following is a sample list of tasks you can perform on your data.

Table operations Column operations Numeric/text/date operations
Filter Add Column Math
Search and Replace Remove Column Get small or large
Remove Duplicates Insert custom values Window functions
Show Top/Bottom rows Combine Extract text
Lookup Split Transform text
Join Duplicate Extract from JSON
Group Convert Increment/decrement date
Reshape   Extract date part
Pivot   Date difference
Save as dataset    

For a detailed description of all available tasks, click here

Pipeline

../../_images/task-pipeline.svg

Fig. 2 A sample task pipeline

A pipeline is a series of tasks applied to the original dataset. A task pipeline is created to transform your original data to arrive at a final data. Once a pipeline of tasks is created, any new data automatically goes through the same task pipeline to produce the final result. Pipelines can be saved as Templates for later use. You can apply a template in a new View as long as the new dataset has the same metadata as the original dataset.

View

You transform your data in a View. A view is a container space for task pipeline, exploration, metrics, and actions. You see your data in a view as transformed by the task pipeline. If there are no tasks, the data seen is the same as the dataset.

You can create multiple views on your dataset, each with a different task pipeline, exploration, metrics, and action. You can also merge additional data into a View through Joins. You can also save the data at any step of a task pipeline to another dataset.

Explore Cards

Explore cards are applicable to column and give you a quick summary of data. When you create a explore card you get to see counts of distinct values in a text column or distribution of values for a numeric or date column. You can change the values displayed to show you other statistics like sum, average, standard deviation etc. with respect to values in another column.

The explore cards are filterable and multiple explore cards can be used in combination. You can also create custom explore cards by providing custom conditions across multiple columns. This uses an advanced condition module to help you build the right conditions.

Explore card filters are short-lived. You can reset the filters and create fresh ones. When you close a view and open again, the filters are lost. However, you can make them permanent by adding the filter condition to the Task Pipeline. This updates the view by adding the task to the pipeline.

Metrics

Metric is an aggregated summary of data in a column. Sum, Count, Average, and Standard deviation of values in a column are examples of a metric. A metric can be conditional on some constraint in the data. You create Metrics in a view to get an insight into critical measures in your data.

Journey of data

Source

Mammoth allows you to fetch your data from different kinds of sources. You can either upload files, or connect to one of the supported databases , get it from a cloud API (like Salesforce, Google Analytics etc) or have webhooks push data to Mammoth. You can also pull any existing public datasets as a Comma separated values (CSV) or a zipped CSV hyperlink on the web.

Organize

Your source data in Mammoth resides as a collection of datasets and its derived Views. Over time the number of datasets becomes numerous. You can organize these datasets into folders to keep them logically organized.

Data flow in Mammoth

Fig. 3 A view is derived from dataset. Multiple views can be created

Discover

When you open a dataset, Mammoth automatically creates the first View to help you start exploring your data. You can create a explore card to do quick exploration and look for any quick insights, anomalies or pattern. Explore cards are a very powerful tool. For more information, see here

Transform

You build perspectives on the data through Views. Views give you the flexibility to transform your data through simple or complex pipelines for you to analyze your data. Following are some points to note about Views:

  1. You can create multiple views on the same dataset.
  2. A view can blend data from single or multiple datasets and Views.
  3. Multiple views can add their end data to the same dataset.

Following are few examples that illustrate the power of Views:

  1. Assume that you have raw data containing a large number of columns. Each column gives a different type of statistics about the subject of analysis. You may find it difficult to deal with so many columns. To simplify, you can create multiple views on the same dataset and hide the columns that are not relevant in each of the views. Each view can then have a pipeline that is relevant for analysis.
  2. In another case, you may have a large number of columns in your data, but only a few of them should be exposed for downstream analysis. You can hide such columns and save the view as a new dataset.
  3. In another example, your data is in two different datasets and you want to join them based on a common key that is present in both the datasets. You can join data from another view with data in your current view to achieve this.
  4. Your data could be coming from multiple data sources and structured differently. You, however, have a standardized structure to final data. You can create pipelines on each dataset through its views and create a standardized output and then append into a target dataset.

Export/Save

After a task is complete in a task pipeline, you can save your data to another dataset or publish the data for further analysis to an external system like databases or visualization platforms like PowerBI or Elasticsearch/Kibana or Google Data Studio. Many external visualization tools connect to one of the supported databases.