Main Concept
Embeddings are numerical representations of real-world objects—such as words, images, or audio—stored as a list of numbers called a vector. Instead of treating data as simple text or pixels, embeddings capture the “meaning” or relationships between items by mapping them into a multidimensional space.
Context
- Computers only understand numbers; however, human language and visual patterns are complex and full of nuances. Embeddings convert complex, unstructured data into a format that machine learning models can process efficiently.
- For instance, embeddings allow LLMs to recognize that the words “king” and “queen” are more closely related than “king” and “apple.”
Key Aspects
- The embedding process works goes after the tokenization, as each token (identified by its Token ID) must be processed through an embedding model to be converted into a multidimensional array.
- Vectors possess a high dimensionality to capture a vast array of features from a single token, such as semantic meaning, syntactic role, sentiment, and more.
- By utilizing multidimensional vectors, models can identify similarities between words (or images) through mathematical operations between vectors (such as cosine similarity).
- Embeddings are stored in specialized Vector Databases (such as Amazon OpenSearch or Amazon Aurora with
patternsgvector), which are optimized for efficient storage and high-speed similarity searches. - Words that have semantic relationship have similar embeddings, example: dog and putty, or cat and kitten.
- Embeddings allow vector databases to do similarity search.
Applications
- Semantic Search
- Recommendation Engines
- Retrieval Augmented Generation
- Anomaly Detection
Examples
- Take the word “love”. Its vector representation captures that it is an emotion, a noun (or verb depending on context), carries a positive valence, and is semantically related to “affection” or “caring.”
- Text: The sentence “I love my cat” is converted into a vector like [0.12, -0.54, 0.88…]. A similar sentence like “I adore my kitten” would result in a vector that is mathematically very close to the first on.
Related Concepts:
Links:
References