Main Concept

A quick reference to the internal components of a Transformer. For AIF-C01, the two most important to understand conceptually are Self-Attention and Positional Encoding — the rest exist and matter for implementation, but are unlikely to be tested.

Components

Self-Attention

The core innovation of the Transformer. Each word (token) looks at every other word in the sequence and decides how much attention to pay to it. This is how the model understands context — for example, linking a pronoun (“it”) back to the noun it refers to several sentences earlier.

Positional Encoding

Transformers process all tokens in parallel, which means they have no built-in sense of word order. Positional encoding adds information about each token’s position in the sequence so the model knows “this word came first, that one came third.”

Multi-Head Attention

Runs multiple self-attention processes in parallel, each potentially capturing a different type of relationship (e.g., one focused on grammar, another on meaning). The results are combined.

Feed-Forward Networks

Standard neural network layers applied after the attention step to further process the information.

Residual Connections & Layer Normalization

Technical mechanisms that stabilize training and allow the model to be very deep (many layers) without degrading performance. Not exam-relevant at the conceptual level.


For AIF-C01: Know that Self-Attention is what makes Transformers powerful, and that Positional Encoding solves the word-order problem created by parallel processing. The other components are supporting infrastructure.