In this Article we discusses techniques for implementing and automating continuous integration (CI), continuous delivery (CD), and continuous training (CT) for machine learning (ML) systems and the various challenges facing that Why Machine Learning Models Crash And Burn In Production and the solution by making MlOps Model
Data science and ML are becoming core capabilities for solving complex real-world problems, transforming industries, and delivering value in all domains. Currently, the ingredients for applying effective ML are available to you:
- Large datasets
- Inexpensive on-demand compute resources
- Specialized accelerators for ML on various cloud platforms
- Rapid advances in different ML research fields (such as computer vision, natural language understanding, and recommendations AI systems).
Therefore, many businesses are investing in their data science teams and ML capabilities to develop predictive models that can deliver business value to their users.
But Why Machine Learning Models Crash And Burn In Production
One magical aspect of software is that it just keeps working. If you code a calculator app, it will still correctly add and multiply numbers a month, a year, or 10 years later. The fact that the marginal cost of software approaches zero has been a bedrock of the software industry’s business model since the 1980s.
This is no longer the case when you are deploying machine learning (ML) models. Making this faulty assumption is the most common mistake of companies taking their first artificial intelligence (AI) products to market. The moment you put a model in production, it starts degrading.
Why Do ML Models Fail?
Your model’s accuracy will be at its best until you start using it. It then deteriorates as the world it was trained to predict changes. This phenomenon is called concept drift, and while it’s been heavily studied in academia for the past two decades, it’s still often ignored in industry best practices.
An intuitive example of where this happens is cybersecurity. Malware evolves quickly, and it’s hard to build ML models that reflect future, unseen behavior. Researchers from the University of London and the University of Louisiana have proposed frameworks that retrain malware detection models continuously in production. The Sophos Group has shown how well-performing models for detecting malicious URLs degrade sharply — at unexpected times — within a few weeks.
The key is that, in contrast to a calculator, your ML system does interact with the real world. If you’re using ML to predict demand and pricing for your grocery store, you’d better consider this week’s weather, the upcoming national holiday and what your competitor across the street is doing. If you’re designing clothes or recommending music, you’d better follow opinion-makers, celebrities and current events. If you’re using AI for auto-trading, bidding for online ads or video gaming, you must constantly adapt to what everyone else is doing.
MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Practicing MLOps means that you advocate for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment and infrastructure management.
Data scientists can implement and train an ML model with predictive performance on an offline holdout dataset, given relevant training data for their use case. However, the real challenge isn’t building an ML model, the challenge is building an integrated ML system and to continuously operate it in production
MLOps : CI/CD pipeline automation
For a rapid and reliable update of the pipelines in production, you need a robust automated CI/CD system. This automated CI/CD system lets your data scientists rapidly explore new ideas around feature engineering, model architecture, and hyperparameters. They can implement these ideas and automatically build, test, and deploy the new pipeline components to the target environment.
The following diagram shows the implementation of the ML pipeline using CI/CD, which has the characteristics of the automated ML pipelines setup plus the automated CI/CD routines.
Lets take a Task to Implement Mlops
Prerequisites to understand the work done in this project :
- Linux OS
- Very basic knowledge of Shell/Unix commands
- Deep learning basics
Steps to be followed :
To implement this project there is a list of steps to follow :
1. Create container image that’s has Python3 and Keras or numpy installed using dockerfile
2. When we launch this image, it should automatically starts train the model in the container.
3. Create a job chain of job1, job2, job3, job4 and job5 using build pipeline plugin in Jenkins
4. Job1 : Pull the Github repo automatically when some developers push repo to Github.
5. Job2 : By looking at the code or program file, Jenkins should automatically start the respective machine learning software installed interpreter install image container to deploy code and start training( eg. If code uses CNN, then Jenkins should start the container that has already installed all the softwares required for the cnn processing).
6. Job3 : Train your model and predict accuracy or metrics.
7. Job4 : if metrics accuracy is less than 80% , then tweak the machine learning model architecture.
8. Job5: Retrain the model or notify that the best model is being created
9. Create One extra job job6 for monitor : If container where app is running. fails due to any reason then this job should automatically start the container again from where the last trained model left
Steps / Approach to be followed :
First, we need to create a directory where we create our Dockerfile in the base OS in my case it is Redhat Rhel8.
# mkdir /ws
# cd /ws
# vim Dockerfile
After saving it we need to build the image using this command :
# docker build -t <tag_name>:<version_tag> .
Ex . # docker build -t mlimg:v1 .
Take a note of the dot ( . ) in the end which means all files.
After building the Docker image you see
we create another directory where the Jenkins downloads the files to
# mkdir /ml_task
Now move toward creating Jenkins jobs
Job1 (Github pull):
Pull the Github repo automatically when some developers push repo to Github.
This job is for downloading our code from Github. And we will set a webhook in git to trigger this job. i.e. every time a developer pushes new code to GitHub this job will get triggered and a new code will be downloaded.
Also we will set a webhook in git to trigger this job. i.e. every time a developer pushes new code to GitHub this job will get triggered and new code will get downloaded.
This setting automatically pulls and downloads the Github code whenever there is a newer version of the code available.
By looking at the code or program file, Jenkins should automatically start the respective machine learning software installed interpreter install image container to deploy code and start training( eg. If code uses CNN, then Jenkins should start the container that has already installed all the software required for the CNN processing).
This job will run after the successful launch of job1 as we enabled the trigger after stable build of Job1. Now using the above image a container will be launched if the code belongs to CNN ( if one is already not running ).
Point to note is that it can be any code you wish. Here by chance I’m considering a CNN to show the power of this setup how finely it performs or tunes the hyperparameters for me until desired accuracy is achieved.
Job3 ( Train and Predict Accuracy ):
This job will automatically trigger after the successful completion of job2. For the training model, I am using as basic a program as LeNet model and i train it on the MNIST dataset ( as we know exactly how great LeNet model performs on the MNIST dataset ).
But before training and predicting, I have intentionally made a few changes to the code (i.e. tweaked the hyperparameters here and there as i want my jenkins to automatically retrain untile the desired accuracy is achieved in the next step.
Job 4 ( Retrain ) :
This job is triggered automatically after the successful completion of the Job3. Basically in this job we read the output to a variable from the accuracy.txt file using the Shell command. Now, using Conditional and loop statements we ask the job to run the retrain.py and new.py files.
About the files :
- accuracy.txt : This file reads and writes the accuracy obtained from the base model that we run in the previous step.
- retrain.py : This is an intermediate file that reads the input file (main file) into a variable and reads the code line by line as it starts building an output file correspondingly with the added code. The added code in my case is the layers ( Conv2D and MaxPooling ) you can also add the Hidden/ Dense layers using this approach.
Finally all these lines are written to a new output file ( new.py in my case ).
You can rewrite over the same input file if you want.
- new.py : This file is just the inout file plus the appended code in the retrain.py file.
Here as you can see, if the accuracy is not matched then another layer is added. You can add as many layers you want. After which i retrain it by changing the epochs as we know the more epochs the better the model predicts.
- You can tweak any hyperparameter this way such as neurons in the Dense layer, Activation function etc )
- Syntax used here is too critical and sensitive. I had a real struggle while creating this job ( Required almost 52 builds )
But i would like to take your attention to a particular Unix command which adds to the existing epochs here
sudo sed -i /^epochs=*/a epochs=epochs+1 /ml_task/new.py
This command has various components that need to be understood
- sed -i : It is a UNIX command used to insert into an existing file
- /^ : this is used to search for the line/keyword after which you would like to insert
- */ : everything in the current directory that ends on a /. Usually files don’t end on
/, but ending a path with a slash means you specified a directory, so this only means directories.
- /ml_task/new.py : new.py is the file in which the changes are to be made
Finally, if the desired accuracy is achieved then only we see the next job being triggered otherwise this build fails.
Job5 ( Notification ) :
This job runs on successful completion of the previous job. It basically sends the mail to the developer. This can be done by the Email Notification option within Manage Jenkins or some plugin such as Job Direct Mail Plugin available on the Plugins section.
But to keep it simple I have used a very simple python program using smtplib and ssl libraries exactly what can be very easily done using Email Notification feature of Jenkins.
Job6 ( Monitor ) :
This is perhaps the most important job as its job is to integrate our setup for us. Ofcourse, it is one thing that it can be easily achieved using Kubernetes which makes container monitoring easy and the system more powerful for us.
But not jumping to something like that for now I have tried to monitor the container in the simplest of ways which is if it goes down we want it to start again which is exactly what i’ve done here.
Hence, this job reruns the responsible container if it stops due to any reason.
create local gt repository and setup Hooks for auto push and auto trigger
Now we see the Jenkins jobs building automatically one after the other.
Here we see the job 3 running. Let me give you a few insights on how the code executes.
Here you can see i had one Conv2D and one MaxPooling2D layer before the Flattening layer, ran only one epoch and got the accuarcy of around 97%. It shows exactly how powerful LeNet model is on the MNIST dataset. Not to forget i tuned it intentionally.
Let’s see what happened after the Job4 was triggered.
Hence, we see another set of layers get added to the existing code. But here one thing is interesting that i have maintained another file for this. The source code is still intact and untouched as the intention is to test.
Hence after traing the model when still accuracy wasn’t achieved we see the model being retrained after adding more epochs
Final glimpse of the new.py file after updation of the epochs.