Main Concept

Training data is the foundation of any ML or AI model. The quality and characteristics of training data directly determine model performance. As the saying goes: garbage in, garbage out — if your training data is poor, your model will be poor.

Training data is arguably the most critical stage in building a good model. Before choosing algorithms or frameworks, you must first understand what data you have and what you’re trying to predict.

Data Types by Structure

Labeled Data (Supervised Learning)

  • Data with known correct answers/labels
  • Example: photos of cats labeled “cat”, photos of dogs labeled “dog”
  • Used for: Classification, regression
  • More expensive to obtain (requires human labeling)
  • Use case: Supervised Learning, where the model is trained to map inputs to known outputs

Unlabeled Data (Unsupervised Learning)

  • Data without pre-assigned labels
  • Example: raw customer purchase histories
  • Used for: Clustering, dimensionality reduction, anomaly detection
  • Cheaper to obtain but harder to use
  • Use case: Unsupervised Learning where the model tries to find patterns or structure the data

Structured Data

  • Organized in rows and columns (databases, spreadsheets)
  • Example: customer age, income, purchase history
  • Machine learning algorithms work well with this
  • Traditional ML (SVM, XGBoost) excels here
  • One type of structure data is Time Series Data, data points collected or recorded at successive point times.

Unstructured Data

  • No predefined format (text, images, audio, video)
  • Example: customer reviews, photos, videos
  • Requires special handling and deep learning
  • GenAI models work with this

Data Characteristics That Matter (AIF-C01)

CharacteristicImpact
SizeMore data generally = better performance (but quality > quantity)
DiversityRepresentative of real-world scenarios reduces bias
BalanceUnbalanced classes (e.g., 99% class A, 1% class B) skews predictions
CleanlinessMissing values, outliers, errors degrade model performance
RecencyStale data may not reflect current patterns
BiasBiased training data produces biased models

The Data Preparation Pipeline

Training data must be prepared before it can be used. See Data Preparation for the specific steps: curation, labeling, quality checks, and bias assessment.

AIF-C01 Exam Relevance

The exam explicitly asks about:

  • Identifying data types (labeled/unlabeled, structured/unstructured)
  • Understanding how data characteristics affect model selection
  • Recognizing when data is insufficient, biased, or unrepresentative
  • Distinguishing between pre-training data and fine-tuning data

Exam tip: A question about “why did my model fail?” often comes down to data quality, not algorithm choice.