Main Concept

MLOps (Machine Learning Operations) is the set of practices that make machine learning models reliable, repeatable, and maintainable in production. It applies the discipline of software operations (DevOps) to the ML lifecycle — ensuring that models can be deployed, monitored, and improved continuously without manual intervention at every step.

Context

Building a model that works in a notebook is very different from running that model reliably in production for millions of users. MLOps bridges that gap. Without MLOps practices, ML projects tend to be fragile, hard to reproduce, and slow to improve.

Key Idea

MLOps is not about building models — it is about keeping models working reliably over time.

Analogy: A car vs a car with a maintenance plan

Building an ML model is like buying a car. It works great on day one. MLOps is the maintenance plan — oil changes, tire rotations, diagnostic checks — that keeps the car running well for years. Without it, the car gradually degrades until it breaks down at the worst possible moment.

The Core MLOps Concepts (Exam Guide)

Experimentation

Tracking and comparing different model versions, hyperparameter combinations, and dataset variations to identify what works best. Without experimentation tracking, you cannot reproduce a good result or understand why one model outperformed another.

Repeatable Processes

Automating the ML pipeline so that training, evaluation, and deployment can be reproduced reliably — not just once by one person. If your process only works on one specific laptop with one specific setup, it is not production-ready.

Scalable Systems

Building infrastructure that handles increasing data volumes and prediction requests without manual intervention. A model that works for 100 users needs to scale to 1,000,000 users without rebuilding everything.

Managing Technical Debt

Avoiding shortcuts that make the system fragile over time. In ML, technical debt accumulates when models are deployed without proper monitoring, documentation, or retraining pipelines — small problems compound until the system fails.

Achieving Production Readiness

Ensuring the model meets quality, performance, and reliability standards before deployment. A model that achieves 95% accuracy in a notebook is not automatically ready for production — it needs validation, load testing, and fallback plans.

Model Monitoring

Continuously tracking model performance in production to detect degradation, drift, or unexpected behavior early. Monitoring is what transforms a one-time deployment into a living, maintained system.

Key Idea: Why monitoring is critical

  • Without monitoring → model degrades silently, users notice bad predictions before you do.

  • With monitoring → you detect drift and quality issues early and fix them before users are impacted.

Model Retraining

Periodically updating the model with fresh data to keep it accurate as the real world evolves. Retraining is the response to model drift — the mechanism that closes the loop between production predictions and improved future models.

Analogy: A weather forecaster

A weather forecaster does not use last year’s data to predict today’s weather. They continuously incorporate the latest observations to keep their predictions accurate. MLOps does the same for ML models — keeps them fed with fresh data so they stay relevant.

The MLOps Feedback Loop

Deploy model

Monitor performance in production

Detect drift or degradation

Collect new production data

Retrain model with fresh data

Evaluate → deploy updated model

(repeat continuously)

AWS Services for MLOps

ServiceRole
Amazon SageMakerEnd-to-end ML platform — training, tuning, deployment
SageMaker Model MonitorAutomated monitoring for drift and data quality
SageMaker PipelinesAutomates and orchestrates the ML workflow
SageMaker ExperimentsTracks and compares model versions and runs
Amazon CloudWatchOperational metrics, alarms, and logging
Amazon SageMaker ClarifyBias detection and model explainability

Exam Scope

MLOps appears explicitly in Domain 1, Task Statement 1.3. You are not expected to implement MLOps pipelines — only to understand the concepts and identify the relevant AWS services. Focus on the six core concepts listed in the exam guide: experimentation, repeatable processes, scalable systems, managing technical debt, production readiness, model monitoring, and model retraining.

Exam Domain

  • Domain 1, Task Statement 1.3: “Understand fundamental concepts of ML operations (MLOps) (for example, experimentation, repeatable processes, scalable systems, managing technical debt, achieving production readiness, model monitoring, model re-training).”
  • Domain 1, Task Statement 1.3: “Identify relevant AWS services for each stage of an ML pipeline.”