Metadata Management with SageMaker Experiments (3/4)
By Ikki Ikazaki, MLOps/Data Engineer at Analysis Group
This is the third part in a multi-part series on how KINTO Technologies Corporation(KTC) developed a system and culture of Machine Learning Operations(MLOps). Please take a look at Part1(How We Define MLOps in KTC) and Part2(Training and Prediction as Batch Pattern). The fourth and final post will discuss "Benkyo-kai", a series of internal study sessions, to form the common knowledge about SageMaker and MLOps with other departments.
As we have already gone through the reason why metadata management is needed in Part1, this chapter will focus on the situation in KTC. Since it has been only four years since the establishment of KINTO Corporation, it was urgent to build relationships between the business and data science teams by presenting a clear output quickly. Sometimes this leads to sacrificing the code quality or documentation, which causes the data science process to be essentially a black box. This approach is not bad at the initial phases of a project, but now the situation has changed and the number of data science projects and members are increasing. We started to look for a better way of management.
The key concepts of metadata management are "sharable format" and "easy tracking". Since there are a few data scientists in our team, we need to have a common format so that the team can manage their work in a standardized way. The aim of having a common format is that every data scientist can reproduce the model building process and its results. This means they need to track the development environment, the data they use, code scripts, ML algorithm, hyperparameters, etc. There is a trade-off between a common format and easy tracking because the more you leave the records for others to easily understand, the more effort is needed. It is ideal if the information that shows up repeatedly such as the container image name, execution time, data storage path, and more is recorded automatically. With SageMaker Experiments, once you write down the definition of SageMaker Pipelines, the information about the development environment is automatically integrated and so the data scientists can focus on the metrics they really want to track. The way of tracking is possible by a few lines of code using SageMaker SDK, so the effort for implementing the tracking is minimum.
It is a good start to introduce the managed service for metadata management, but just using it is not enough. SageMaker Experiments is actually composed by small fragments called Experiment and Trial where Experiment is a group of Trials which usually represents a specific use case while Trial stands for each iteration of ML processings. Then, a proper naming convention is required to manage those components so that data scientists can refer to each other's projects or iterations. The below table of naming convention about SageMaker Experiments is used in our team.
We consider that metadata management is required mainly on occasions: Experiment of pipeline and Experiment of EDA(Exploratory Data Analysis). Experiment of pipeline is used to track the information about every step of pipeline used in production, so it offers the possibility to reproduce when the pipeline fails somehow. By following the naming convention above, you can find the environmental information at 1.1's Trial — SageMaker Pipelines automatically creates this trial —, and analytical information at 1.2's Trial — you need to create this trial on your own in the pipeline.
Also, it is often the case that data scientists want to track their experimental code and model outcomes outside of the production pipeline. Then Experiment of EDA allows them to track the information flexibly. The naming convention of this type is also flexible compared to the Experiment of pipeline, but it may be updated in the future once we find the pattern in experiments.
Utilizing the managed service is good because there is the benefit of "Standing on the shoulders of giants". SageMaker Experiments provides us with the way of metadata management and analysis on it. By looking up 1.2's trial of Experiments of pipeline, you can inspect the metrics of the model building pipeline. You can even create pandas DataFrame of the Experiments and analyze the time series of its trials. For example, if you track the precision rate at every pipeline execution, you can make a line chart to see the historical change in the metrics. It is useful to detect what is called "model drift" in advance. The sooner you detect some type of anomalies, the faster the ML continuous training process begins.
What do you think about this article? Next time, the last post of this four-part series will introduce "Benkyo-kai" and how we share technical knowledge in our company(now available). Please follow the KTC Tech Blog for future posts to stay up to date.