Amazon Polly

Main Concept

Amazon Polly is the opposite of Amazon Transcribe — it converts text into realistic speech using deep learning. It allows you to build applications that talk, generating natural-sounding audio from written text without any ML expertise required.

Key Idea

Input → written text.

Output → realistic spoken audio.

Opposite of → Amazon Transcribe (which goes speech → text).

Core Advanced Features

Lexicons

Define how Polly should pronounce specific words or abbreviations — teaching it to say something different from what is literally written.

Example

Written text: “AWS” → Polly says: “Amazon Web Services” Written text: “W3C” → Polly says: “World Wide Web Consortium”

Use case: technical content, brand names, or acronyms that should be spoken in full rather than spelled out.

SSML — Speech Synthesis Markup Language

A markup language that gives Polly precise instructions on HOW to pronounce text — pauses, emphasis, whispering, abbreviations, and more. You wrap text in SSML tags instead of sending plain text.

Key Idea

Plain text → Polly decides how to read it.

SSML → YOU control exactly how it is read (pauses, emphasis, whisper, pronunciation).

Example

SSML input: Hello <break time="2s"/> how are you? Polly output: says “Hello” → pauses 2 seconds → says “how are you?”

Without SSML it would read everything continuously without the pause.

Use case: audio content that requires precise pacing — audiobooks, announcements, interactive voice response (IVR) systems.

Voice Engines

Polly offers multiple voice engine types, from oldest to newest:

Standard   → basic text-to-speech, older generation
Neural     → more natural-sounding, uses neural networks
Long-form  → optimized for longer content like articles and books
Generative → newest, most human-like voices

Key Idea

The newer the engine, the more human-like the voice.

Generative is the most advanced and most natural-sounding.

Speech Marks

Metadata that tells you exactly WHERE in the audio a specific word or sentence starts and ends — Polly returns both the audio file and the speech mark data together.

Example

Use case 1: lip-syncing — a video game character’s mouth movements need to match the exact timing of each word in the audio.

Use case 2: word highlighting — an e-learning application highlights each word on screen as it is spoken, in real time.

Common Exam Scenarios

Key Idea: When the answer is Amazon Polly

“Convert article text to audio for a podcast-style feature” → Amazon Polly.

“Build an application that reads content aloud to visually impaired users” → Amazon Polly.

“Generate voice responses for an IVR (Interactive Voice Response) phone system” → Amazon Polly.

“Control exactly how abbreviations are pronounced in generated audio” → Lexicons.

“Add precise pauses and emphasis to synthesized speech” → SSML.

“Sync mouth animations to spoken audio in a game character” → Speech Marks.

Exam Scope

You will not be asked how to implement Polly. You need to:

Know what Polly does (text-to-speech).
Know it is the opposite of Amazon Transcribe.
Recognize the four advanced features: Lexicons, SSML, Voice Engines, Speech Marks.
Match each feature to its use case in a scenario question.
Know that Generative is the newest and most human-like voice engine.

Exam Domain

Domain 1, Task Statement 1.2: “Explain the capabilities of AWS managed AI/ML services (for example, Amazon Polly).”
Domain 1, Task Statement 1.2: “Identify examples of real-world AI applications (for example, speech recognition).”

🌿💻 The Packets Garden

Explorer

Amazon Polly

Main Concept

Core Advanced Features

Lexicons

SSML — Speech Synthesis Markup Language

Voice Engines

Speech Marks

Common Exam Scenarios

Exam Scope

Exam Domain

Graph View

Table of Contents

Backlinks

🌿💻 The Packets Garden

Explorer

Amazon Polly

Main Concept

Core Advanced Features

Lexicons

SSML — Speech Synthesis Markup Language

Voice Engines

Speech Marks

Common Exam Scenarios

Exam Scope

Exam Domain

Related Notes

Graph View

Table of Contents

Backlinks