20260420T2331-training-data

Main Concept

Training data is the foundation of any ML or AI model. The quality and characteristics of training data directly determine model performance. As the saying goes: garbage in, garbage out — if your training data is poor, your model will be poor.

Training data is arguably the most critical stage in building a good model. Before choosing algorithms or frameworks, you must first understand what data you have and what you’re trying to predict.

Data Types by Structure

Labeled Data (Supervised Learning)

Data with known correct answers/labels
Example: photos of cats labeled “cat”, photos of dogs labeled “dog”
Used for: Classification, regression
More expensive to obtain (requires human labeling)
Use case: Supervised Learning, where the model is trained to map inputs to known outputs

Unlabeled Data (Unsupervised Learning)

Data without pre-assigned labels
Example: raw customer purchase histories
Used for: Clustering, dimensionality reduction, anomaly detection
Cheaper to obtain but harder to use
Use case: Unsupervised Learning where the model tries to find patterns or structure the data

Structured Data

Organized in rows and columns (databases, spreadsheets)
Example: customer age, income, purchase history
Machine learning algorithms work well with this
Traditional ML (SVM, XGBoost) excels here
One type of structure data is Time Series Data, data points collected or recorded at successive point times.

Unstructured Data

No predefined format (text, images, audio, video)
Example: customer reviews, photos, videos
Requires special handling and deep learning
GenAI models work with this

Data Characteristics That Matter (AIF-C01)

Characteristic	Impact
Size	More data generally = better performance (but quality > quantity)
Diversity	Representative of real-world scenarios reduces bias
Balance	Unbalanced classes (e.g., 99% class A, 1% class B) skews predictions
Cleanliness	Missing values, outliers, errors degrade model performance
Recency	Stale data may not reflect current patterns
Bias	Biased training data produces biased models

The Data Preparation Pipeline

Training data must be prepared before it can be used. See Data Preparation for the specific steps: curation, labeling, quality checks, and bias assessment.

AIF-C01 Exam Relevance

The exam explicitly asks about:

Identifying data types (labeled/unlabeled, structured/unstructured)
Understanding how data characteristics affect model selection
Recognizing when data is insufficient, biased, or unrepresentative
Distinguishing between pre-training data and fine-tuning data

Exam tip: A question about “why did my model fail?” often comes down to data quality, not algorithm choice.

🌿💻 The Packets Garden

Explorer

20260420T2331-training-data

Main Concept