CRISP-DM, one AI/ML Lifecycle: An Introduction

This is meant to be a simple introduction to the CRISP-DM framework, which is just one of many artificial intelligence and machine learning lifecycles. There are numerous sources for deeper understanding.

The CRISP-DM framework, the CRoss-Industry Standard Process for Data Mining, was created in 1996. The process consists of six major phases:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

The sequence between the phases is not strict. The outer circle represents the symbolic nature of data mining itself. A data mining process continues after the initial problem is solved, often leading to more focused business questions or refinement of the data.

CRISP-DM Phase 1 – Business Understanding

Business understanding consists of four main steps:

Understanding business requirements
Analyzing supporting information
Converting to a data mining problem
Preparing a preliminary plan

The business question needs to be both specific and measurable. It can then be turned into a machine learning question. For example, the business question “What customers should we target for a new product?” can be turned into the machine learning question “Would this customer buy the product or not?”. Evaluating the cost of creating a data mining solution to the business value of the question is important. As with all business projects, proper planning is essential, including risks, goals, dependencies, tools and techniques, and project duration.

CRISP-DM Phase 2 – Data Understanding

Data understanding has three primary steps:

Data collection
Data properties
Data quality

The data collection step entails listing data sources and what data to extract from those sources, analyzing the data for additional requirements, and determining if any additional data source is needed. The data properties include understanding the metadata of the data, the size of the set, key features and relationships between data elements, including correlation between elements. The data quality step involves determining if there are any missing data elements, if these can be removed or substituted.

CRISP-DM Phase 3 – Data Preparation

Data preparation includes the final data set selection and preparing the data. The final data set should keep in mind constraints such as total size, which columns to include and exclude, record selection, and element data types. Data preparation may involve cleaning, transforming, merging data sets, normalizing, or formatting the data. The number of records can be a consideration if the data set is small and missing elements can be filled in with default values or using statistical methods. It may be useful to revisit the data understanding phase after this phase is completed.

CRISP-DM Phase 4 – Modeling

Modeling is arguably the most fun phase. It consists of three main steps:

Model selection and creation
Creating a model testing plan
Parameter testing and tuning

This modeling step is tied to the data understanding phase because the model selection influences the data preparation and vice versa. Further testing may reveal that the data doesn’t fit well into the type of modeling algorithm used and Phase 3 must be revisited. Obviously, the first step is to choose a modeling algorithm and the tools needed to do it. Model testing is generally broken into a test and training data set. The split can vary depending on the data set and algorithm. A common split is 30% test and 70% training. An evaluation criterion should be chosen at this time. The actual training can involve tuning hyper parameters to adjust the accuracy or speed of training.

CRISP-DM Phase 5 – Evaluation

Evaluation is where you evaluate how the model is performing with relation to your business goals defined in the business understanding phase and make a decision on if the model should be deployed or not. This depends on the evaluation criteria you outlined in the modeling phase. It is important to keep in mind business considerations like the cost of false positives or negatives, execution speed, and cost. Review the steps taken throughout the process to verify that all criteria are met. Finally determine if the model should be deployed.

CRISP-DM Phase 6 – Deployment

There are four phases in deployment:

Planning deployment
Maintenance and monitoring
Final report
Project review

First, you need to determine where the model will be deployed. For example, on AWS there are many options including Amazon EC2, Amazon Elastic Container Service, and AWS Lambda. Then you need to decide how the model will be deployed and managed. For example on AWS, AWS CodeDeploy, AWS CloudFormation, AWS OpsWorks, and AWS Elastic Beanstalk. As with all well-architected systems, monitoring system health is important. Examples on AWS include Amazon CloudWatch, AWS CloudTrail, and AWS Elastic Beanstalk. A final report is delivered to stakeholders, highlighting the processes used, if the project goals were met, any findings, and explain the model used and reasoning behind using it. The project review assesses what went wrong, what went write, and determine if any parts of the process can be reused.

Michael McCarthy

Michael is veteran software engineer and cloud computing aficionado. After starting his career as a Java software engineer, he evolved into a consultant, focusing first on enterprise content management and later on AWS. He is currently an AWS Cloud Practitioner and AWS Solutions Architect Associate, although he has held many more certifications in the past.