Main Concept

Embeddings are numerical representations of real-world objects—such as words, images, or audio—stored as a list of numbers called a vector. Instead of treating data as simple text or pixels, embeddings capture the “meaning” or relationships between items by mapping them into a multidimensional space.

Context

  • Computers only understand numbers; however, human language and visual patterns are complex and full of nuances. Embeddings convert complex, unstructured data into a format that machine learning models can process efficiently.
  • For instance, embeddings allow LLMs to recognize that the words “king” and “queen” are more closely related than “king” and “apple.”

Key Aspects

  • The embedding process works goes after the tokenization, as each token (identified by its Token ID) must be processed through an embedding model to be converted into a multidimensional array.
  • Vectors possess a high dimensionality to capture a vast array of features from a single token, such as semantic meaning, syntactic role, sentiment, and more.
  • By utilizing multidimensional vectors, models can identify similarities between words (or images) through mathematical operations between vectors (such as cosine similarity).
  • Embeddings are stored in specialized Vector Databases (such as Amazon OpenSearch or Amazon Aurora with patternsgvector), which are optimized for efficient storage and high-speed similarity searches.
  • Words that have semantic relationship have similar embeddings, example: dog and putty, or cat and kitten.
  • Embeddings allow vector databases to do similarity search.

Applications

  • Semantic Search
  • Recommendation Engines
  • Retrieval Augmented Generation
  • Anomaly Detection

Examples

  • Take the word “love”. Its vector representation captures that it is an emotion, a noun (or verb depending on context), carries a positive valence, and is semantically related to “affection” or “caring.”
  • Text: The sentence “I love my cat” is converted into a vector like [0.12, -0.54, 0.88…]. A similar sentence like “I adore my kitten” would result in a vector that is mathematically very close to the first on.


References