Speak
Speech Generation
Speak
Generate speech audio from text using a Bland TTS voice.
POST
Speak
Overview
Synthesizes text into a WAV audio file using a Bland TTS voice. Setstream: true to receive the audio as a chunked response while it’s being generated.
Requires a Bland TTS voice (
BTTS, BTTS_V2, or BTTS_V3).Every generation is automatically stored and retrievable via List TTS Generations and Get TTS Generation.
Pricing
Text-to-speech pricing scales with plan:| Plan | Rate |
|---|---|
| Start | $0.02 per 100 characters |
| Build | $0.02 per 150 characters |
| Scale | $0.02 per 200 characters |
| Enterprise | $0.02 per 400 characters |
x-cost response header.
Headers
Your API key for authentication.
Body Parameters
The text to synthesize. Maximum 5,000 characters per request.Supports pause markers in the form
<|N|> where N is a duration between 0.1 and 10.0 seconds, for example: "Welcome to Bland. <|0.8|> Let's get started."ID of the Bland TTS voice to use. Pass either the voice UUID from List Voices or a curated voice name (for example
willow, juniper, valentine_experimental).Audio container/sample rate of the response.
pcm_8000, 8 kHz PCM16 mono (telephony)pcm_16000, 16 kHz PCM16 monopcm_24000, 24 kHz PCM16 monopcm_44100, 44.1 kHz PCM16 mono (default, studio)ulaw_8000, 8 kHz u-law mono (telephony)
When
true, the response is sent with Transfer-Encoding: chunked as audio is generated. The first 44 bytes are a WAV header with placeholder sizes (0xFFFFFFFF). Subsequent chunks are PCM16 audio data. The client backfills bytes 4-7 (RIFF chunk size minus 8) and 40-43 (data chunk size) after the stream closes.Language code for synthesis. Defaults to the voice’s primary language. Available languages depend on the voice’s underlying model (V2 and V3 voices support 17+ languages).
Voice consistency control.
- For
BTTS(V1) voices, a float between 0.0 and 1.0. Higher values produce more consistent output. - For
BTTS_V2andBTTS_V3voices, an integer between 1 and 32 (per_decode). Lower values produce more consistent output.
Expressiveness control for
BTTS (V1) voices only. Float between 0.0 and 1.0. Higher values produce more expressive speech.Expressiveness boost flag for
BTTS_V2 and BTTS_V3 voices. 0 (off) or 1 (on).Response
Returns a binary WAV file withContent-Type: audio/x-wav. Inspect response headers for latency and cost.
Time in milliseconds from request to first audio byte (streaming) or full response (non-streaming).
Cost in USD for the synthesis, matching what was billed.
Always
audio/x-wav on success.Total bytes in the audio (non-streaming only).
chunked when stream: true.Docs for agents: llms.txt