Main Concept
MLOps (Machine Learning Operations) is the set of practices that make machine learning models reliable, repeatable, and maintainable in production. It applies the discipline of software operations (DevOps) to the ML lifecycle — ensuring that models can be deployed, monitored, and improved continuously without manual intervention at every step.
Context
Building a model that works in a notebook is very different from running that model reliably in production for millions of users. MLOps bridges that gap. Without MLOps practices, ML projects tend to be fragile, hard to reproduce, and slow to improve.
Key Idea
MLOps is not about building models — it is about keeping models working reliably over time.
Analogy: A car vs a car with a maintenance plan
Building an ML model is like buying a car. It works great on day one. MLOps is the maintenance plan — oil changes, tire rotations, diagnostic checks — that keeps the car running well for years. Without it, the car gradually degrades until it breaks down at the worst possible moment.
The Core MLOps Concepts (Exam Guide)
Experimentation
Tracking and comparing different model versions, hyperparameter combinations, and dataset variations to identify what works best. Without experimentation tracking, you cannot reproduce a good result or understand why one model outperformed another.
Repeatable Processes
Automating the ML pipeline so that training, evaluation, and deployment can be reproduced reliably — not just once by one person. If your process only works on one specific laptop with one specific setup, it is not production-ready.
Scalable Systems
Building infrastructure that handles increasing data volumes and prediction requests without manual intervention. A model that works for 100 users needs to scale to 1,000,000 users without rebuilding everything.
Managing Technical Debt
Avoiding shortcuts that make the system fragile over time. In ML, technical debt accumulates when models are deployed without proper monitoring, documentation, or retraining pipelines — small problems compound until the system fails.
Achieving Production Readiness
Ensuring the model meets quality, performance, and reliability standards before deployment. A model that achieves 95% accuracy in a notebook is not automatically ready for production — it needs validation, load testing, and fallback plans.
Model Monitoring
Continuously tracking model performance in production to detect degradation, drift, or unexpected behavior early. Monitoring is what transforms a one-time deployment into a living, maintained system.
Key Idea: Why monitoring is critical
Without monitoring → model degrades silently, users notice bad predictions before you do.
With monitoring → you detect drift and quality issues early and fix them before users are impacted.
Model Retraining
Periodically updating the model with fresh data to keep it accurate as the real world evolves. Retraining is the response to model drift — the mechanism that closes the loop between production predictions and improved future models.
Analogy: A weather forecaster
A weather forecaster does not use last year’s data to predict today’s weather. They continuously incorporate the latest observations to keep their predictions accurate. MLOps does the same for ML models — keeps them fed with fresh data so they stay relevant.
The MLOps Feedback Loop
Deploy model
↓
Monitor performance in production
↓
Detect drift or degradation
↓
Collect new production data
↓
Retrain model with fresh data
↓
Evaluate → deploy updated model
↓
(repeat continuously)AWS Services for MLOps
| Service | Role |
|---|---|
| Amazon SageMaker | End-to-end ML platform — training, tuning, deployment |
| SageMaker Model Monitor | Automated monitoring for drift and data quality |
| SageMaker Pipelines | Automates and orchestrates the ML workflow |
| SageMaker Experiments | Tracks and compares model versions and runs |
| Amazon CloudWatch | Operational metrics, alarms, and logging |
| Amazon SageMaker Clarify | Bias detection and model explainability |
Exam Scope
MLOps appears explicitly in Domain 1, Task Statement 1.3. You are not expected to implement MLOps pipelines — only to understand the concepts and identify the relevant AWS services. Focus on the six core concepts listed in the exam guide: experimentation, repeatable processes, scalable systems, managing technical debt, production readiness, model monitoring, and model retraining.
Exam Domain
- Domain 1, Task Statement 1.3: “Understand fundamental concepts of ML operations (MLOps) (for example, experimentation, repeatable processes, scalable systems, managing technical debt, achieving production readiness, model monitoring, model re-training).”
- Domain 1, Task Statement 1.3: “Identify relevant AWS services for each stage of an ML pipeline.”