Main Concept

Feature engineering is the process (or art) of using domain knowledge to identify, within raw data, which variables β€” or which combinations of variables β€” best represent the phenomena that are relevant to the problem at hand. The goal is to provide the model with features that carry real, meaningful information, so it can find significant patterns and make better predictions from new input data.

Context

Feature engineering is a core step in the ML development lifecycle, sitting between raw data collection and model training. The quality of features directly determines a model’s ability to learn β€” no algorithm compensates for poorly represented data.

Domain knowledge is essential: an engineer who understands the domain (e.g., automotive engines, clinical medicine, financial markets) knows which variables and combinations reveal meaningful signals, without which the model only sees noise.

In Deep Learning, neural networks can learn internal features automatically from raw data (the basis of embeddings), which reduces but does not eliminate the need for thoughtful feature design.

Core Techniques

TechniqueDescriptionExample
Feature CreationDerive new variables from existing onesprice / sqft from price and area
Feature SelectionKeep only the variables most predictiveDrop color when predicting house price
Feature TransformationRescale or encode data for algorithm compatibilityNormalize numerical values; encode categories
Feature ExtractionPull meaningful information from raw dataDerive age from birth_date

Structured vs. Unstructured Data

Structured (Tabular)

  • Input: spreadsheet with rows and columns
  • Tasks: create ratios, select relevant columns, normalize scales
  • Example: predicting house prices using size, location, bedrooms β€” dropping color, keeping price_per_sqft

Unstructured (Text, Images)

  • Input: raw text or pixels
  • Tasks: convert to numerical representations the model can process
  • Example: customer review β†’ TF-IDF vectors or word embeddings; image β†’ edge features via CNNs

Example

A spam classifier might use these features extracted from raw emails:

  • Number of words in ALL CAPS
  • Contains the word β€œfree”
  • Sender is unknown

None of these exist as a column in raw email data β€” they are engineered from it.

AWS Service

Amazon SageMaker Feature Store β€” managed repository to store, share, and reuse features across teams and models, ensuring consistency between training and inference.


Exam Domain (AIF-C01)

Domain 1 β€” Fundamentals of AI and ML

  • Task Statement 1.3: Describe the ML development lifecycle β€” feature engineering is an explicit step between data preparation and model training.

References