Data Preparation using Dataiku

Data Preparation using Dataiku


While dealing with data, machine learning and artificial intelligence, efficient tools can never be overlooked. Dataiku is one such platform that is used for performing artificial intelligence tasks, systemizing the data, create and deliver data and provide advanced analytics by using the latest technologies available. Data used by the organization needs to be initially prepared before processing or analyzing it. Dataiku helps to connect, cleanse and prepare data before sending it to machine learning and advanced analytics. As data flows in from multiple sources, Dataiku is enabled to connect to all of these sources. Dataiku provides connectors to 25 data sources like Amazon S3, Azure Blob Storage, SQL and NoSQL databases, HDFS, Snowflake, Google Cloud Storage etc., on both on-premise and cloud.

As an easy-to-use platform, it helps in data preparation, data wrangling and data cleansing. It comes with a very simple visual interface that allows you to join, group or aggregate datasets, clean the data by removing redundant records, normalize and enrich the data. All the steps once followed are captured by Dataiku and can be used as a reference for further usage. Data transformation is also made way too simple by providing you with more than 90 transformers. All the data manipulation transformers used for binning, filtering, splitting, data conversions, currency conversions and concatenation etc., are provided in the visual as built-in transformers. Any additional transformers that are required for data manipulation can be easily created by writing a formula for the same. Automatic data type suggestions also help the user to save up on a lot of time. Geospatial transformation functions are also another built-in feature of Dataiku. These functions provide the latitude and longitude data from a geo point data and vice versa.


Data preparation in Dataiku is run via a process called recipes. A recipe is a set of repeatable actions that are performed on one or more input datasets that results in one or more output datasets. Dataiku follows three such recipes as part of data preparation.

  1. Visual Recipe: This recipe covers the most common data transformations that are performed on an input dataset. They include filtering, concatenations, splitting, joining, sorting etc. An option of recording the actions is provided and this reduces the effort in repeating the steps all over again. This recipe includes a few other intermittent recipes. The Filter recipe is used to sample raw data. The Stack recipe performs a union of the training data sample and the test data sample. The Prepare recipe handles the large amount of data cleansing and enrichment. With the help of a built-in visual processors, the data wrangling is made code-free.
  2. Code Recipe: Any additional code changes that needs to be done after the visual recipe is done here. Jupyter notebook or any other IDE can be integrated and any programming language can be used to implement any customizations.
  3. Plugin Recipe: In addition to the built-in visual recipes, one can create new visual recipes too by creating a plugin. Additional functionalities can be wrapped in these reusable components. A plugin store is present in case any existing plugin has to be pulled in.

Any recipe is run by creating a job. The flexible computational environment allows the users to take control over the execution of the job. The task composition within a job and the execution engine can be controlled totally by the user. The Jobs menu displays the status of all the current and the older jobs. Log files are displayed too. This helps in optimizing the flow.


#RandomTrees #Dataiku