Main Concept
Amazon Polly is the opposite of Amazon Transcribe — it converts text into realistic speech using deep learning. It allows you to build applications that talk, generating natural-sounding audio from written text without any ML expertise required.
Key Idea
Input → written text.
Output → realistic spoken audio.
Opposite of → Amazon Transcribe (which goes speech → text).
Core Advanced Features
Lexicons
Define how Polly should pronounce specific words or abbreviations — teaching it to say something different from what is literally written.
Example
Written text: “AWS” → Polly says: “Amazon Web Services” Written text: “W3C” → Polly says: “World Wide Web Consortium”
Use case: technical content, brand names, or acronyms that should be spoken in full rather than spelled out.
SSML — Speech Synthesis Markup Language
A markup language that gives Polly precise instructions on HOW to pronounce text — pauses, emphasis, whispering, abbreviations, and more. You wrap text in SSML tags instead of sending plain text.
Key Idea
Plain text → Polly decides how to read it.
SSML → YOU control exactly how it is read (pauses, emphasis, whisper, pronunciation).
Example
SSML input:
Hello <break time="2s"/> how are you?Polly output: says “Hello” → pauses 2 seconds → says “how are you?”Without SSML it would read everything continuously without the pause.
Use case: audio content that requires precise pacing — audiobooks, announcements, interactive voice response (IVR) systems.
Voice Engines
Polly offers multiple voice engine types, from oldest to newest:
Standard → basic text-to-speech, older generation
Neural → more natural-sounding, uses neural networks
Long-form → optimized for longer content like articles and books
Generative → newest, most human-like voices
Key Idea
The newer the engine, the more human-like the voice.
Generative is the most advanced and most natural-sounding.
Speech Marks
Metadata that tells you exactly WHERE in the audio a specific word or sentence starts and ends — Polly returns both the audio file and the speech mark data together.
Example
Use case 1: lip-syncing — a video game character’s mouth movements need to match the exact timing of each word in the audio.
Use case 2: word highlighting — an e-learning application highlights each word on screen as it is spoken, in real time.
Common Exam Scenarios
Key Idea: When the answer is Amazon Polly
“Convert article text to audio for a podcast-style feature” → Amazon Polly.
“Build an application that reads content aloud to visually impaired users” → Amazon Polly.
“Generate voice responses for an IVR (Interactive Voice Response) phone system” → Amazon Polly.
“Control exactly how abbreviations are pronounced in generated audio” → Lexicons.
“Add precise pauses and emphasis to synthesized speech” → SSML.
“Sync mouth animations to spoken audio in a game character” → Speech Marks.
Exam Scope
You will not be asked how to implement Polly. You need to:
- Know what Polly does (text-to-speech).
- Know it is the opposite of Amazon Transcribe.
- Recognize the four advanced features: Lexicons, SSML, Voice Engines, Speech Marks.
- Match each feature to its use case in a scenario question.
- Know that Generative is the newest and most human-like voice engine.
Exam Domain
- Domain 1, Task Statement 1.2: “Explain the capabilities of AWS managed AI/ML services (for example, Amazon Polly).”
- Domain 1, Task Statement 1.2: “Identify examples of real-world AI applications (for example, speech recognition).”