Short, curated guide on monitoring ML models in production

AI/ML

min read

Olha Diachuk

March 25, 2025

If you know, you know—deploying an ML model is just the beginning. Even the best models can fail in dynamic real-world environments without proper monitoring ML models in production, leading to wasted resources and missed opportunities. Why? Because real-world data is never a warm bath of pre-trained historical data, and all conditions change in real-time. Monitoring brings order to this uncertainty, triggering prepared responses to issues of the inner states like accuracy drops, pipeline failures, or latency spikes.

More than just troubleshooting, ML monitoring drives purpose. It aligns metrics with the project’s goals, ensuring the model delivers value and makes the processes under the hood explainable, which is of the utmost importance for observability requirements. Monitoring also uncovers insights, revealing growth opportunities or areas for experimentation. Oftentimes, we call those areas—“challenges.”

Equally important, monitoring benefits all stakeholders by making efforts measurable and impactful. It ensures observability, detects inefficiencies, and keeps the entire IT ecosystem running smoothly. In short, monitoring is the key to long-term ML success.

With this article, you’ll understand the basics of ML model monitoring, learn the best practices applicable to your project, and learn from the experience of other companies.

Why model monitoring is essential in machine learning

In machine learning, the production phase begins with the highest value delivery—the model is fresh and crispy, works expectedly, and has the best UX. Once everything is deployed, the project starts interacting with real-world data, users, and systems, where its true impact happens. The everyday workload influences the model and degrades its performance over time, as the data it consumes changes in real time.

*The difference between the distribution of training and production data*

This phenomenon, known as model decay, occurs when the data distribution shifts, user behavior evolves, or external factors change. Recent scientific research shows that 91% of cases experienced a significant decline in performance quality, indicating model decay.

For example, a recommendation system for e-commerce might perform well during the holiday season but falter as customer preferences shift afterward. Monitoring allows teams to detect these changes early, enabling retraining or fine-tuning to keep the model relevant and effective.

Proper model monitoring makes these changes curated and controllable, letting the core functions deliver value, remain reliable, and adapt to changing conditions.

No model exists in a vacuum

Your project is a large ecosystem with a machine learning core, data pipelines, APIs, infrastructure, code, management interface, data storage, and user interactions. Any issue in this ecosystem can impact the model’s performance.

For instance, a data pipeline failure might feed incomplete or corrupted data into the model, leading to inaccurate predictions. Similarly, infrastructure bottlenecks can slow down inference times, frustrating users.

Monitoring the entire ecosystem—not just the model—ensures that all components work harmoniously.

You can find a reliable example of robust monitoring systems in production environments in Uber's real-time monitoring practices. They use advanced monitoring tools to track the health of their machine learning models, data pipelines, and infrastructure.

*The accuracy report by Uber’s Michelangelo |* *Source*

Uber's Michelangelo platform is designed to monitor model performance, detect anomalies, and ensure data consistency across its ecosystem. These features ensure that their models, such as ETA predictions or surge pricing algorithms, remain accurate and reliable in real time.

The single source of truth

In production, the correct monitoring setup becomes the single source of truth for all team members. Whether it’s data scientists, MLOps engineers, or business stakeholders, everyone relies on monitoring dashboards and alerts to understand how the model is performing.

For example, suppose an ML fraud detection system starts flagging an unusually high number of transactions. In that case, monitoring can help pinpoint whether the issue lies in the model, the data, or external factors like a spike in fraudulent activity.

Tools like Prometheus, Grafana, and specialized ML monitoring platforms like Arize AI or Evidently AI provide actionable insights that keep teams aligned and informed.

Real-life impacts of monitoring

Monitoring is consequential in industries where ML models directly impact critical decisions. In healthcare, a diagnostic model might perform well initially but could degrade as new medical data becomes available. Without monitoring, this decay could lead to misdiagnoses.

There’s a story for that as well! The Epic Sepsis Model (ESM) was designed to predict the onset of sepsis in hospitalized patients. Initially, it performed well during its development and early deployment phases. However, studies later revealed that its performance degraded significantly over time due to changes in patient populations, clinical practices, and data distributions.

*Threshold performance plots for the ESM |* *Source*

The research found that the ESM had a sensitivity of only 33%, meaning it missed two-thirds of sepsis cases while also generating a high rate of false alarms. This degradation in performance underscores the critical need for continuous monitoring and recalibration of diagnostic models in healthcare.

Key challenges in ML model performance over time

Several challenges can creep in and compromise performance when deploying ML models in production. One of the foremost challenges is data distribution changes, which we’ve mentioned above. Training-serving skew is another critical factor. The environment used during training can differ significantly from the live production environment. These differences may result in a skew between what the model learned and how it performs during inference. This is why aligning the training pipeline with the serving infrastructure is key to maintaining expected performance.

Model readiness is a must-have attribute; even a well-functioning system requires thorough pre-deployment testing to ensure it meets all performance and safety standards once it is live. Equally important is maintaining the health of the data pipeline. Inconsistent data ingestion, delayed processing, or integration bottlenecks can trigger cascading failures that adversely affect the model.

Closely related is the issue of model ownership in production. The lines between development and operations often blur, leading to confusion about who should monitor, update, or troubleshoot the whole ML ecosystem. Clear responsibility and robust documentation are essential for smooth transitions and accountability.

Model and concept drift further complicate matters as the underlying relationships in data shift over time.

*Model drift mentioned by Watson OpenScale (IBM) tool |* *Source*

Regular evaluation cycles and adaptive retraining become imperative to counterbalance these drifts. In addition, many production-grade models are black boxes—high-performing yet opaque. While they can yield impressive results, their lack of transparency often demands additional interpretability tools to help diagnose issues quickly.

Another emerging challenge is the presence of concerted adversaries. Malicious actors can subtly manipulate inputs to degrade performance, a phenomenon well-documented in adversarial machine learning research.

An underperforming system might not necessarily be the model’s fault; sometimes, infrastructure and resource limitations hinder performance. Finally, cases of extreme events or outliers, along with persistent data quality issues, can disrupt output in unexpected ways.

Key aspects of machine learning model monitoring

Let’s reveal the key aspects of monitoring ML models by answering the following questions:

What are we going to observe, audit, and check?
When will we do that? (The short answer—constantly.)
How will we do that? Which tools will we apply?

So, your machine learning monitoring will start from the first business-technician session on business goals and basic vital metrics. You can track and improve your model performance by selecting the right metric for your model performance.

Choose a metric that means the same throughout your entire ML ecosystem, is simple to understand, is trackable in real-time, and can trigger alerts for quick problem resolution.

Rework (if you haven’t yet) your business goals into ML KPIs for analysis and check.

Business KPI sounds like:	While ML KPI might be like this:	Actual technical conditions of ML KPI
Execute the ML function that will bring the Expected Value.	How fast/precise should the ML model provide valuable results for the end user above the Expected Value threshold?	For UX and “fast” part: HTTP 200 OK response (or similar). ~100 ms latency Uptime and SLA >95%. For the “precise” part, depending on what your project does, there should always be a distribution of model outputs—from acceptable to bad ones.

Based on the objective, the following metrics are relevant (but not limited to):

1. Precision: Measures the proportion of flagged transactions that are actually fraudulent. High precision ensures fewer false positives, which is critical for user trust.

Precision = True Positives / (True Positives + False Positives)

2. Recall (Sensitivity): Measures the proportion of actual fraudulent transactions that the model successfully identifies. High recall ensures it catches most fraud cases.

Recall = True Positives / (True Positives + False Negatives)

3. F1 Score: A harmonic mean of precision and recall, useful when balancing both metrics.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

4. False Positive Rate (FPR): Tracks the proportion of legitimate transactions incorrectly flagged as fraud. This is critical for monitoring user experience.

FPR = False Positives / (False Positives + True Negatives)

5. Latency: Measures the time taken for the model to process a transaction. Low latency is essential for real-time fraud detection in payment systems.

Latency = Time taken for prediction (measured in milliseconds or seconds)

These metrics and technical ones give you a scratch off the surface of all functional and operational monitoring levels.

Functional level monitoring is responsible for performance, inputs (data), and outputs (predictions). Data scientists and ML engineers can deal with it daily.
Operational level monitoring is monitoring at the system and resource level. Additionally, DevOps, ITOps, and AIOps experts are mostly responsible here.

These levels, combined with various tools, will set your own style of model performance monitoring.

Model monitoring tools and techniques

The actions and techniques to be taken vary depending on the location of the issues in the monitoring.

Data quality issue detection techniques

Testing input data for duplicates and missing values.
Implementing data profiling for complex dependencies in the data pipeline.
Catching syntax errors, data type, and format errors.
Checking schema for semantic errors in terms of feature names.
Conducting general integrity checks.

Solutions to apply to data quality challenges:

Implement an alert after a schema change.
Implement data validation on the data owners’ level.
Make data owners communicate so minimize unpleasant surprises after a change is made at the data source.

Data/feature drift

Data drift occurs when there is a significant shift in the distribution of input data between the training phase and the production environment. This gradual change in data patterns can degrade model performance over time, though it typically happens more slowly compared to issues related to data quality.

Machine learning model monitoring at the feature level proves invaluable during analysis when trying to understand its performance and behavior. Feature drift can be tracked by observing changes in the statistical characteristics of individual feature values over time, such as mean, standard deviation, frequency, and other key metrics.

Data drift detection techniques

You can analyze distribution changes using distance metrics and statistical tests to identify data drift. Key metrics for comparing historical and current features include mean, standard deviation, minimum and maximum values, and correlation. Tests like Kullback–Leibler divergence, Kolmogorov-Smirnov statistics, Population Stability Index (PSI), and Hellinger distance are commonly used for continuous features. Methods such as the chi-squared test, entropy, cardinality, or frequency analysis are effective for categorical features.

Some platforms offer built-in monitoring tools for outlier detection using machine learning and unsupervised techniques. When working with datasets containing a large number of features, dimensionality reduction methods like PCA can help simplify the data before applying statistical tests.

Platform	Key features	Drift detection methods	Best for
Evidently AI	Open-source Python library Real-time dashboards 20+ statistical tests	Statistical tests Distance metrics Data quality checks	Teams needing flexible, open-source monitoring
Arize AI	Real-time monitoring Unified platform Custom dashboards	Data drift Concept drift Performance tracking	Enterprise-scale ML operations
WhyLabs	Automated monitoring Data pipeline checks Collaborative tools	Automated drift detection Quality metrics Anomaly detection	Teams seeking automated monitoring
IBM Watson OpenScale	Multi-framework support Fairness monitoring Explainability tools	Distribution analysis Bias detection Feature drift	Large enterprises with diverse ML stacks
PhiMonitor	Python library Batch processing Customizable metrics	Jensen-Shannon divergence Wasserstein distance Overfitting detection	Technical teams needing detailed metrics
Seldon Core	Kubernetes native Real-time monitoring Prometheus integration	Data drift Concept drift Performance metrics	Kubernetes-based deployments
TFDV	Schema validation Distribution analysis TensorFlow integration	Statistical validation Distribution tests Schema checks	TensorFlow-based workflows
Databricks/MLflow	Experiment tracking Cloud-based Scalable solution	Kolmogorov-Smirnov tests Statistical analysis Performance tracking	Data-intensive operations
Neptune.ai	Experiment tracking Visualization tools Collaboration features	Metric tracking Distribution analysis Performance monitoring	Research and development teams
Grafana/Prometheus	Custom dashboards Flexible metrics System monitoring	Custom metric tracking Alert systems Time-series analysis	Teams with existing Prometheus setup

What to do with data drift

The most practical approach is to trigger an alert and notify the service owner when significant data drift is detected. You can use an orchestration tool to initiate a retraining job with production data. If the distribution change is substantial, building a new model using the updated data may be necessary.

In many cases, the new production data alone may not be sufficient for retraining or starting all over again. You can combine the new data with historical training data to address this. During retraining, you can assign higher weights to features that have experienced significant drift, ensuring the model adapts effectively to the changes.

If you are fortunate to have enough new production data for the task, you can proceed to build one or more challenger models. They can be deployed and tested using techniques like shadow testing or A/B testing.

Outliers, events that are not normal, and concerted adversaries detection

For these deviations, all stat methods mentioned above are helpful. And some more to try:

Apply suitable distribution tests to measure how far outliers deviate from the feature distributions in the training dataset.
Analyze whether the features most critical to your model’s performance—those identified as most important during training—have undergone significant changes.
Perform statistical distance tests on individual events or small batches of recent events to detect out-of-distribution issues.
Leverage unsupervised learning techniques to group model inputs and predictions, helping to identify clusters of anomalous examples and predictions.

Concerted adversaries target your system through intentional adversarial attacks, often by introducing misleading examples to cause errors and unreliable results. These attacks, though rare, pose a significant safety risk to machine learning applications in production and require constant monitoring.

*Adversarial attacks in ML and their prevention* | *Source*

Reactions to outliers and adversarial attacks

Perform data slicing on sub-datasets to evaluate model performance for specific prediction subclasses. Automate this by logging predictions into an evaluation store via your monitoring tool.
If metrics show consistent poor performance, assess the model's current state and consider training a new challenger model. Document issues to identify if they stem from seasonal trends or rare outliers, helping you plan future troubleshooting strategies.
When retraining fails or the new model underperforms, review the performance benchmark. In such cases, temporarily involve a specialist in the decision-making process to maintain accuracy.
To counter concerted adversaries, route detected anomalies to human supervisors before using predictions. Experts can analyze these cases to strengthen the model against future threats.

In critical applications, speed is key. Quickly detecting, analyzing, addressing adversarial threats, retraining the model, and redeploying it can be crucial for business success.

Model drift: types and actionable insights

Instantaneous drift occurs when model performance suddenly drops. This could stem from data pipeline bugs, deployment in a new domain, or rare outlier events like global crises.
Gradual drift, the most common type, arises from evolving business dynamics. Examples include shifting user preferences, new customer demographics, or added features altering data patterns.
Recurring drift results from predictable, seasonal events like holidays or annual sales. These patterns are periodic, often tied to user behavior or regional differences.
Temporary drift is harder to spot with rule-based methods and often requires unsupervised techniques. It’s triggered by rare, unexpected events such as adversarial attacks, unintended product use, temporary client changes, or system issues.

Need more on this part? Dive deeper into model drift theory here.

Detecting model drift

All data drift’s statistical methods are applied here as well. Other specific methods are as follows:

Track your model's predictive performance using evaluation metrics. Set a threshold for these metrics to identify when the system produces unreliable results, then investigate prediction drift—shifts in prediction patterns over time.
Monitoring data drift helps detect early signs of model degradation or drift, prompting further analysis.
For supervised learning, monitor label drift by comparing actual labels (ground truth) with model predictions. This reveals trends and shifts in data interpretation.

What to do with model/concept drift

Continuously monitor and retrain deployed models to align with your business needs. Automate retraining at set intervals for rapidly changing environments, unlike more stable businesses.
If retraining fails to boost performance, consider remodeling or rebuilding the model entirely.
For large-scale projects with flexible budgets and a need to adapt to dynamic business changes, explore online learning algorithms for better performance.

Monitoring of predictions

Monitoring model output in production is crucial for assessing performance and meeting business KPIs. The key focus is aligning predictions with business metrics to evaluate success.

Model evaluation metrics

Evaluating models in production relies on predefined metrics like accuracy, AUC, precision, or RMSE, depending on the task (classification, regression, etc.). These metrics compare predictions to ground truth labels representing the correct real-world outcomes. For example, in an ad-click prediction model, the ground truth is whether a user actually clicked the ad. Real-time feedback makes this comparison straightforward but in cases like loan approvals, where outcomes take months or years, a more complex feedback loop is required.

Challenges with ground truth

The model itself can sometimes influence ground truth. For instance, a loan approval model might predict repayment likelihood, but it’s impossible to confirm if rejected applicants would have repaid. This makes ground truth an imperfect measure in some scenarios.

Scoring models with ground truth

When ground truth is available, model predictions are logged alongside actual outcomes. A monitoring system collects this data, linking predictions to ground truth events, and calculates performance metrics like accuracy or RMSE. This process often involves real-time systems, human annotators, or external labeling services for complex tasks.

Scoring models without ground truth

When ground truth is unavailable or unreliable, prediction drift becomes the performance proxy. Monitoring platforms log the system’s predictions and track their distribution over time. Statistical metrics like Hellinger Distance, Kullback-Leibler Divergence, or Population Stability Index help detect shifts in prediction patterns, ensuring alignment with business KPIs.

System performance monitoring for ML models in production

Monitoring system performance answers critical questions: Is uptime sufficient? Are requests processed quickly? Are resources optimized? Can the system handle code changes and scale effectively? Identifying limitations is crucial for improvement.

System performance metrics

System performance impacts model efficiency. High latency in predictions slows the entire system. Key metrics include:

CPU/GPU utilization: Tracks resource consumption per API call.
Memory usage: Monitors caching for faster I/O.
Failed requests: Identifies operational issues.
API call volume: Measures system demand.
Response time: Evaluates prediction service speed.

System reliability

Infrastructure reliability underpins ML success. Monitor cluster uptime, machine status, and request distribution across prediction services. While not a primary focus for ML engineers, understanding system reliability enhances overall performance.

Data and model pipelines

Healthy pipelines are vital. Data pipeline issues degrade quality, while model pipeline failures disrupt retraining and deployment. Collaboration with DataOps ensures alignment between model expectations and pipeline outputs.

Data pipeline metrics:

Input data structure, schema, and validation.
Workflow task outputs and runtimes.
Output data schema and size.
Statistical metrics like mean, standard deviation, and correlation.
Job runtimes and success rates.

Model pipeline metrics:

Dependency versions to prevent compatibility issues.
Retraining job runtimes, resource usage, and outcomes.

Cost and SLAs

Track hosting costs, including storage, compute, and inference expenses. Cloud providers like AWS and Google Cloud offer tools for budget tracking and alerts. For on-premise systems, analyze cost-heavy components to optimize spending. Monitor service-level agreements (SLAs) to ensure performance thresholds are met.

Monitoring vs. Observability

Monitoring collects metrics, detects issues, and triggers alerts. Observability connects these metrics to identify root causes and improve system quality. While monitoring gathers data, observability provides actionable insights.

Best practices for effective model monitoring

Depending on your ML infrastructure level of maturity, you’ll be interested in monitoring various metrics on various “depths” of your ecosystem.

General advice on monitoring: team, culture, tools

Monitoring is only as strong as the people behind it. Foster a culture where data is treated like a product—this encourages ownership and ensures your team is invested in delivering value from start to finish. Start tracking metrics, logs, and experiments during the model development phase to build a proactive monitoring culture.
Avoid centralizing monitoring responsibilities with a single person. Instead, empower your cross-functional team—data experts and Ops engineers alike—to manage their respective services while maintaining open communication. This approach spreads knowledge and prevents burnout as your use cases grow.
Keep your monitoring stack lean. Too many tools can complicate workflows. Focus on centralizing tools while keeping the team structure decentralized for better efficiency. When making critical decisions, prioritize what enhances your team’s productivity and long-term efficiency.
Review your metrics once in a while and adapt them accordingly. For example, if new types of fraud emerge, you may need to monitor additional metrics like model confidence scores or feature importance shifts to detect anomalies.
Encourage thorough documentation of troubleshooting processes. Establish a clear framework for moving from alerts to actions, ensuring smooth and effective system maintenance.

Best practices for data monitoring

Invest in a centralized data catalog to log high-quality metadata. This resource will support your data and ML teams, streamline lineage tracking, and help maintain consistent data quality across workflows.
Process batch and streaming data through a unified pipeline. This simplifies troubleshooting by making pipeline issues more predictable and easier to identify.
Dig deeper into feature-level drift for more actionable insights and a clearer understanding of changes.
Validate your model on the evaluation set before deployment to establish a reliable baseline for performance in production.

Model monitoring advice

Model performance will naturally decline over time, but a sharp drop often signals a deeper issue. Use tools that can automatically detect and alert you to these anomalies.
Leverage “dark launches” or shadow deployments to test your challenger model against the champion model.

Log predictions to compare their performance side by side before fully deploying the new model.

Use an experiment tracker to store hyperparameters for versioned and retrained models in production. This enhances traceability, compliance, auditing, and simplifies troubleshooting.

Alerting best practices

Clearly define who handles which alerts. Data quality alerts should go to data engineering or DataOps; model quality alerts should be sent to the ML team or data scientists, and system performance alerts should be sent to IT Ops. Encourage the team to classify alerts (false positives, negatives, or true positives), document their responses, and track outcomes to refine processes.
Focus on alerts tied to real business impact. Avoid flagging every minor issue—prioritize what matters most to your organization.
Only set alerts for conditions that truly require action. Establish thresholds for data quality, and trigger alerts only when those thresholds are breached.
Avoid “alert fatigue” by eliminating unnecessary notifications. Keep the focus on critical, actionable alerts to prevent important issues from being lost in the noise.

Case studies: How leading companies monitor ML models

Aside from the mentioned above, we’d like to talk about not only corporate giants to show that monitoring ML models is available for many market players of different sizes.

DoorDash: Monitoring models to prevent delivery delays

DoorDash uses machine learning to optimize delivery times and predict demand. Their official engineering blog described how they detected performance degradation in their models due to unexpected changes in customer ordering patterns and external factors like weather.

Approach	Expectations	Adoption	Parity between training and production
Unit test	Pass/fail status	Opt-in	Assumes training data will match production data
Monitoring	Trends distribution	Out-of-the-box	Does not assume that training data will match production data

Combination of unit testing and monitoring for comprehensive results applied by the team

They identified issues early by monitoring key metrics such as prediction accuracy and delivery time variance and retrained their models to adapt, preventing significant disruptions in their logistics operations.

Lyft: Real-time monitoring for ride-matching models

Lyft employs machine learning to optimize ride matching and pricing. In their engineering blog, Lyft described how they use real-time monitoring to track model performance metrics like latency and accuracy.

*Performance drift detection visualization for a model at Lyft*

When they detected performance degradation due to external factors like seasonal demand changes, they retrained their models to ensure reliability and efficiency in their services.

Capital One: Monitoring Models for Fraud Detection

Capital One employs machine learning models to detect and prevent fraudulent transactions. In their official engineering blog, they describe how they monitor these models for data drift, concept drift, and performance degradation.

*Distribution of credit score feature for training samples VS test (unobserved) data*

By implementing real-time monitoring and automated retraining pipelines, they ensure their fraud detection systems remain accurate and adaptive to new fraud patterns, preventing financial losses and maintaining customer trust.

Enjoy your model eternally young with full-fledged monitoring and actionable data insights!