ML workflow with Airflow, MLflow and SageMaker
Regarding MLOps, there are many tools to support data, workflow, model, .etc management.
And we can start with a simple ML workflow using following platforms
- Airflow for the run orchestration.
- MLflow for the experiment tracking and organization.
- SageMaker for job training, hyperparameter tuning, model serving and production monitoring.
About the Airflow and MLflow setups, we can deploy them in any infrastructure (K8s, ECS, .etc) with meta data stored in RDS.
We will use Airflow as a scheduler so we don’t need a complex worker architecture, all the computation jobs will be handled by SageMaker and other AWS services.
MLflow provided 4 main features related to ML lifecycle including central model registry, model deployment, project code management and experiment tracking.
Starting from determining what is observed, what should be predicted and how performance and error metrics need to be optimized.
The business problem is framed as a machine learning problem, follow by some steps
- Data acquisition: ingesting data from sources including data collection, data integration and data quality checking.
- Data pre-processing: handling missing data, outliers, long tails, .etc.
- Feature engineering: running experiments with different features, adding, removing and changing features.
- Data transformation: standardizing data, converting data format compatible with training algorithms.
- Job training: training’s parameters, metrics, .etc are tracked in the MLflow. We can also run SageMaker Hyperparameter Optimization with many training jobs then search the metrics and params in the MLflow for a comparison with minimal effort to find the best version of a model.
- Model evaluation: analyzing model performance based on predicted results on test data.
- If business goals are met, the model will be registered in the SageMaker Inference Models. We can also register the model in the MLflow.
- Getting predictions in any of the following ways:
- Using SageMaker Batch Transform to get predictions for an entire dataset.
- Setting up a persistent endpoint to get one prediction at a time using SageMaker Inference Endpoints.
9. Monitoring and debugging the workflow, re-training with a data augmentation.
For the data processing, feature engineering and model evaluation, we can use several AWS services
- EMR: providing a Hadoop ecosystem cluster including pre-installed Spark, Flink, .etc. We should use a transient cluster to process the data and terminate it when all done.
- Glue job: providing a server-less Apache Spark, Python environments. Glue’ve supported Spark 3.1 since 2021 Aug.
- SageMaker Processing jobs: running in containers, there are many prebuilt images supporting data science. It also supports Spark 3.
Data accessing
- All data stored in S3 can be queried via Athena with metadata from Glue data catalog.
- We can also ingest the data into SageMaker Feature Store in batches directly to the offline store.
Sample workflow
- Dataset: Kaggle Retail Data Analytics
- For full stages, please refer to this GitHub repo
Training and hyperparameter tuning jobs
- AWS SageMaker is cost-effective with EC2 spot instances.
- In order to log the training parameters and metrics in MLflow, we should use the SageMaker script mode with a below sample training script.
- And we can specify the SageMaker estimator and hyperparameter ranges for the tuning jobs.
- Then let’s take a look at different models’ metrics with different parameters in the MLflow.
Conclusion
This basic ML workflow could be a good starting point, then we can expand it by adding other stuffs such as data versioning, CI/CD process, online feature store, .etc.
For more details, please take a look at a below GitHub repo, and welcome all constructive comments!