Main Concept

EDA is the phase of the ML lifecycle where data scientists explore and understand the raw data before building any model. The goal is to understand the shape, quality, distribution, and relationships within the data to make informed decisions about feature engineering and model selection

Context

EDA happens early in the ML project lifecycle — after data collection and before feature engineering and model training. The insights from EDA directly inform which features to use, which to discard, and whether the data is sufficient for the intended ML task.

Key Activities

Statistical Analysis

  • Compute basic statistics: mean, median, min, max, standard deviation.
  • Identify missing values, outliers, and data quality issues.
  • Understand class distribution — are the classes balanced or imbalanced?

Data Visualization

  • Plot distributions to understand the shape of the data.
  • Identify patterns, trends, and anomalies visually.
  • Common charts: histograms, scatter plots, box plots.

Why EDA Matters for Model Quality

Tip

Poor EDA → wrong features → poor model performance
Good EDA → right features → better model performance

EDA directly feeds into feature engineering — you cannot engineer good features from data you do not understand.

AWS Services

ServiceRole in EDA
Amazon SageMaker Data WranglerVisual data exploration, transformation, and analysis
Amazon QuickSightBusiness intelligence and data visualization
AWS Glue DataBrewVisual data preparation and profiling
Amazon S3Central storage for raw datasets

Exam Domain

  • Domain 1, Task Statement 1.3: “Describe components of an ML pipeline (for example, data collection, exploratory data analysis [EDA], data pre-processing, feature engineering…).”
  • Domain 1, Task Statement 1.3: “Identify relevant AWS services and features for each stage of an ML pipeline (for example, SageMaker, Amazon SageMaker Data Wrangler…).”

References