MLOps – Machine Learning Operations

Kiran Kumar Nallagonda


The continuous process of operationalizing machine learning models to get business value requires observability, monitoring, feedback mechanism to retrain the models whenever necessary.

Gartner predicted in 2020 that, 80 percent of AI projects would remain alchemy i.e. run by wizards whose talents will not scale in the organization and that only 20 percent of analytical insights will deliver business outcomes by 2022.  Rackspace corroborates that claim in a survey completed in January of 2021 saying that 80 percent of companies are still exploring or struggling to deploy ML models.

 The general challenges are that most of the models are difficult to use, hard to understand, have least explainability and are computationally intensive. With these challenges, it is very hard to extract the business value. The goal of MLOps is to extract business value from the data by efficiently operationalizing ML models at scale.  A data scientist may find a model which functions as per business requirements, but deploying the model into production with observability, monitoring and feedback loop complete with automated pipelines, at low expense, high reliability and at scale require entirely different set of skills. This can be achieved in parallel collaboration with DevOps teams.

An ML engineer builds ML pipelines that can reproduce the results of the models discovered by the data scientist automatically, inexpensively, reliably and at scale.

MLOps Principles

Here are a few principles to keep them in check for better MLOps:

         a) Tracking or Software Configuration

         ML models are software artifacts that need to be deployed. Tracking provenance is critical for deploying any good software and typically handled through version control systems. But, building ML models depends on complex details such as data, model architectures, hyper parameters and external software.  Keeping track of these details is vital, but can be simplified greatly with the right tools, patterns, and practices. For example, this complexity could be simplified by adopting dockerization and/or kubernetization of all components and overlaying usual DevOps version controls.

         b) Automation and DevOps

         Automation is key to modern DevOps, but it’s more difficult for ML models. In a traditional software application, a continuous integration and continuous delivery (CI/CD) pipeline would pick up some versioned source code for deployment. For an ML application, the pipeline should not only automate training models, but also automate model retraining along with archival of training data and other artifacts.

         c) Monitoring/Observability

         Monitoring software requires good logging and alerting, but there are special considerations to be made for ML applications. All predictions generated by ML models should be logged in such a way that enables traceability back to the model training job. ML applications should also be monitored for invalid predictions or data drift, which may require models to be retrained.

         d) Reliability

         ML models can be harder to test and computationally more expensive than traditional software. It is important to make sure your ML applications function as expected and are resilient to failures. Getting reliability right for ML requires some special considerations around security and testing.

         e) Cost Optimization

         MLOps are more deeply involved with cost intensive infrastructure resources and personnel. Continuous cost monitoring and making necessary adjustments from time to time to optimize the cost as well as to drive more business value is extremely important. For some of the models, training could be the cost intensive part of the work, when compared to entire life cycle of the model and its operations. But this cost equation could change entirely when model gets deployed and scaled to numerous instances. For example, initially Alexa’s speech to text, NLP, NLG related model training was cost intensive in terms of collecting, processing the data and training the models using expensive computational resources. After the models are deployed on the cloud and scaled to planet level, most of the cost shifted to inference layer part of the MLOps.

         These kinds of cost dynamics can be tackled by estimating and monitoring the costs, adopting right technologies, architectures and processes.

         In the above example, inference layer cost is off-loaded to the device itself partially, instead of utilizing the cloud resources in every instance.

         Even the training cost will have different equation when federated learning kind of architectures are adopted. Apart from these dynamics, standardizing on the right tools for tracking (and training) models will noticeably reduce the time and effort necessary to transfer models between the data science and data engineering teams.

Model Registry

A model registry acts as a location for data scientists to store models as they are trained, simplifying the bookkeeping process during research and development. Models retrained as part of the production deployment should also be stored in the same registry to enable comparison to the original versions. 

A good model registry should allow tracking of models by name/project and assigning a version number. When a model is registered, it should also include metadata from the training job. At the very least, the metadata should include:

  • Location of the model artifact(s) for deployment.  
  • Revision numbers for custom code used to train the model, such as the git version hash for the relevant project repository.
  • Information on how to reproduce the training environment, such as a Dockerfile, Conda environment YAML file, or PIP requirements file.
  • References to the training data, such as a file path, database table name, or query used to select the data.

Without the original training data, it will be impossible to reproduce the model itself or explore variations down the road. Try to reference a static version of the data, such as a snapshot or immutable file. In the case of very large datasets, it can be impractical to make a copy of the data. Advanced storage technologies (e.g. Amazon S3 versioning or a metadata system like Apache Atlas) are helpful for tracking large volumes of data.

Having a model registry puts structure around the handoff between data scientists and engineering teams. When a model in production produces erroneous output, registries make it easy to determine which model is causing the issue and roll back to a previous version of the model if necessary. Without a model registry, you might run the risk of deleting or losing track of the previous model, making rollback tedious or impossible. Model registries also enable auditing of model predictions.

Some data scientists may resist incorporating model registries into their workflows, citing the inconvenience of having to register models during their training jobs. Bypassing the model-registration step should be discouraged as a discipline and disallowed by policy. It is easy to justify a registry requirement on the grounds of streamlined handoff and auditing, and data scientists usually come to find that registering models can simplify their bookkeeping as they experiment.

Good model-registry tools make tracking of models virtually effortless for data scientists and engineering teams; in many cases, it can be automated in the background or handled with a single API call from model training code.

Model registries come in many shapes and sizes to fit different organizations based on their unique needs.  Common options fall into a few categories:

  • Cloud-provider registries such as Sagemaker Model Registry or Azure Model Registry.  These tools are great for organizations that are committed to a single cloud provider.
  • Open-source registries like MLflow, which enable customization across many environments and technology stacks. Some of these tools might also integrate with external registries; for instance, MLflow can integrate with Sagemaker Model Registry.
  • Registries incorporated into high-end data-science platforms such as Dataiku DSS or DataRobot. These tools work great if your data scientists want to use them and your organization is willing to pay extra for simple and streamlined ML pipelines.

Feature Stores

Feature stores can make it easier to track what data is being used for ML predictions, but also help data scientists and ML engineers reuse features for multiple models. A feature store provides a repository for data scientists to keep track of features they have extracted or developed for models. In other words, if a data scientist retrieves data for a model (or engineers a new feature based on some existing features), they can commit that to the feature store. Once a feature is in the feature store, it can be reused to train new models – not just by the data scientist who created it, but by anyone within your organization who trains models.

The intent of a feature store is not only to allow data scientists iterate quickly by reusing past work, but also to accelerate the work for productionizing models. If features are committed to a feature store, your engineering teams can more easily incorporate the associated logic into the production pipeline. When it’s time to deploy a new model that uses the same feature, there won’t be any additional work to code up new calculations.  

Feature stores work the best for organizations that have commonly used data entities that are applicable to many different models or applications. Take, for example, a retailer with many e-commerce customers – most of that company’s ML models will be used to predict customer behavior and trends.  In that case, it makes a lot of sense to build a feature store around the customer entity. Every time a data scientist creates a new feature to better represent customers, it can be committed to the feature store for any ML model making predictions about customers. 

Another good reason to use feature stores is for batch-scoring scenarios. If you are scoring multiple models on large batches of data (rather than one-off/real-time) then it makes sense to pre-compute the features. The pre-computed features can be stored for reuse rather than being recalculated for every model.

MLOps Pipeline

More efficient pipelines are constructed in combination with DevOps.  Here are outlined steps:

  1. Establish Version Control
  2. Implement CI/CD pipeline
  3. Implement proper logging, centralized log stash, retrieval and querying the logs.
  4. Monitor
  5. Iterate for continuous improvement


Developing an ML production pipeline that delivers business value is extremely challenging and can be mitigated with right deployment of resources, tools, personnel, expertise, and best practices. Remember to keep it simple, iterate it to continuously improve till it meets necessary business value.


Algorithmic Portfolio Management

Photo by Ramón Salinero on Unsplash

Bhanu Nallagonda

When everything goes algorithmic nowadays, why not Portfolio Management?

“Algorithmic Portfolio Management” gets a few thousand results on Google, compared to about 9 million results for “Algorithmic Trading” and in LinkedIn training, one gets zero results as on date!

In algorithmic trading or algo trading for short, preprogrammed algorithms or set of processes execute the trades. Its volumes have steadily increased over years, reaching about 60-80% of the total trading volumes depending on the markets, higher in advanced equity and forex markets and with about 40-50% of trading volume being generated in commodity markets. It also increases volatility and certain risks with millions to billions of market value getting wiped off within minutes and then recovering.

The top reasons for using algo trading are – ease of use, improved trader productivity, consistency of execution performance, lower costs/commissions, better monitoring and high speed/lower latency. Money management fund managers use algo trading to implement their investment decisions. There are traditional strategies such as mean reversion, price or earnings momentum, value and multi-factor or combination of multiple strategies and machine learning based ones such as artificial neural networks, k-NN and Bayes etc.

One specific trend over the years has been diminishing alpha and it is increasingly becoming difficult for actively managed funds to beat their benchmark indices, after expenses. ETFs are making a comeback or gaining mind and market share in the recent years. In the US, passive ETFs have attracted more investments than passive mutual funds. In order to keep the growing tendency of operational and management costs going up, there is an increasing need to leverage technology to be more efficient and effective.

Then there are quant funds, in which securities to invest are chosen through quantitative analysis based on numerical data and without any subjective intervention. While their cost of management is lower as fund managers’ efforts and interventions are much lower, their performance has not been consistent over long time.

So, how is Algorithmic Portfolio Management different from algo trading and is there a case for it to be similarly popular going forward, in this algorithm driven world? It is likely to be so, and let us look at it, along with the causes and trends that would drive it up in future.

Robo-advisory services, which provide algorithmic financial planning services to individuals after collecting their information, have been getting popular. They started with passive indexing strategies and moved onto more sophisticated optimization with variants of modern portfolio theory, tax loss harvesting and retirement planning.

With the advent of ever-increasing computational power and availability of broader and deeper data, Machine Learning brings in more sophistication to the algorithms. Machine Learning (ML) and Artificial Intelligence (AI) make analysis of new forms of data such as unstructured ones, hitherto not practical. While absence of investors’ human biases and subjective judgements are touted as advantages, AI/ML models can have their own biases depending on the data fed to them, deficiencies and limitations of the algorithms used and may even reflect the biases and preferences of algorithm constructors.

In Algorithmic Portfolio Management, a portfolio of assets and sub-assets need to be managed for better risk adjusted returns. That is a key difference when compared to algo trading, which is more of one dimensional, focused on single security at a time. So, the key aspects of Algorithmic Portfolio Management are:

  • Asset Allocation
  • Portfolio Construction
  • Portfolio Execution
  • Performance Monitoring and Evaluation
  • Rebalancing

Asset allocation is the single biggest factor that determines a large percentage of the returns or variance in the returns of the portfolio over long periods. Efficiency Frontier can optimize the portfolio for low risk and high expected return or vice versa. Diversification with negatively or low correlated securities lower the standard deviation or the risk. Monte Carlo simulation is used for risk analysis by producing distributions of possible outcomes. Apart from these age old and traditional techniques, principal component analysis can be used for feature selection i.e. to choose the parameters and aspects that matter and ML algorithms can be used for better optimization. While there are diminishing benefits with greater diversification, machine driven algorithmic approach can make management of higher number of securities more effective and easier than with human based processes.

Portfolio construction involves careful selection of securities for better risk adjusted returns. Algorithmic frameworks including macro and micro level decisions can be used for greater alignment of investment objectives and risk profiles.

Portfolio execution in terms of buying and selling securities can leverage algo trading for lower market impact and better outcomes. Higher percentage of larger ticket size trades of institutions tend to use algo trading more than for smaller trades.

With the availability real time and near real time data and computational power, portfolio performance monitoring and evaluation can be more frequent and thus triggering effective rebalancing in near real time based on market data for more optimal returns.

Passive rebalancing including calendar and/or percentage-based rebalancing is used for robo-advisory approaches. Algorithmic management can bring in more sophisticated and optimization for active and dynamic asset allocation and rebalancing driven by them. Dynamic asset allocation is not driven by fixed percentage allocations, but involves a more nuanced approach of changing the securities and their constitution based on analysis or algorithmic output.

In conclusion, fund managers are expected to leverage algorithmic portfolio management to complement subjective decisions, to reduce the costs of management and for greater alpha, though portfolios may not be driven entirely by algorithms in near future.