Main Concept

Amazon Polly is the opposite of Amazon Transcribe — it converts text into realistic speech using deep learning. It allows you to build applications that talk, generating natural-sounding audio from written text without any ML expertise required.

Key Idea

  • Input → written text.

  • Output → realistic spoken audio.

  • Opposite of → Amazon Transcribe (which goes speech → text).

Core Advanced Features

Lexicons

Define how Polly should pronounce specific words or abbreviations — teaching it to say something different from what is literally written.

Example

Written text: “AWS” → Polly says: “Amazon Web Services” Written text: “W3C” → Polly says: “World Wide Web Consortium”

Use case: technical content, brand names, or acronyms that should be spoken in full rather than spelled out.

SSML — Speech Synthesis Markup Language

A markup language that gives Polly precise instructions on HOW to pronounce text — pauses, emphasis, whispering, abbreviations, and more. You wrap text in SSML tags instead of sending plain text.

Key Idea

  • Plain text → Polly decides how to read it.

  • SSML → YOU control exactly how it is read (pauses, emphasis, whisper, pronunciation).

Example

SSML input: Hello <break time="2s"/> how are you? Polly output: says “Hello” → pauses 2 seconds → says “how are you?”

Without SSML it would read everything continuously without the pause.

Use case: audio content that requires precise pacing — audiobooks, announcements, interactive voice response (IVR) systems.

Voice Engines

Polly offers multiple voice engine types, from oldest to newest:

Standard   → basic text-to-speech, older generation
Neural     → more natural-sounding, uses neural networks
Long-form  → optimized for longer content like articles and books
Generative → newest, most human-like voices

Key Idea

  • The newer the engine, the more human-like the voice.

  • Generative is the most advanced and most natural-sounding.

Speech Marks

Metadata that tells you exactly WHERE in the audio a specific word or sentence starts and ends — Polly returns both the audio file and the speech mark data together.

Example

Use case 1: lip-syncing — a video game character’s mouth movements need to match the exact timing of each word in the audio.

Use case 2: word highlighting — an e-learning application highlights each word on screen as it is spoken, in real time.

Common Exam Scenarios

Key Idea: When the answer is Amazon Polly

  • “Convert article text to audio for a podcast-style feature” → Amazon Polly.

  • “Build an application that reads content aloud to visually impaired users” → Amazon Polly.

  • “Generate voice responses for an IVR (Interactive Voice Response) phone system” → Amazon Polly.

  • “Control exactly how abbreviations are pronounced in generated audio” → Lexicons.

  • “Add precise pauses and emphasis to synthesized speech” → SSML.

  • “Sync mouth animations to spoken audio in a game character” → Speech Marks.

Exam Scope

You will not be asked how to implement Polly. You need to:

  • Know what Polly does (text-to-speech).
  • Know it is the opposite of Amazon Transcribe.
  • Recognize the four advanced features: Lexicons, SSML, Voice Engines, Speech Marks.
  • Match each feature to its use case in a scenario question.
  • Know that Generative is the newest and most human-like voice engine.

Exam Domain

  • Domain 1, Task Statement 1.2: “Explain the capabilities of AWS managed AI/ML services (for example, Amazon Polly).”
  • Domain 1, Task Statement 1.2: “Identify examples of real-world AI applications (for example, speech recognition).”