Skip to main content
POST
/
v1
/
speak
Speak
curl --request POST \
  --url https://api.bland.ai/v1/speak \
  --header 'Content-Type: application/json' \
  --header 'authorization: <authorization>' \
  --data '
{
  "text": "<string>",
  "voice_id": "<string>",
  "output_format": "<string>",
  "stream": true,
  "language": "<string>",
  "consistency": 123,
  "expressiveness": 123,
  "boost": 123
}
'
HTTP/1.1 200 OK
Content-Type: audio/x-wav
Content-Length: 98806
x-latency: 396
x-cost: 0.001

<WAV binary>

Overview

Synthesizes text into a WAV audio file using a Bland TTS voice. Set stream: true to receive the audio as a chunked response while it’s being generated.
Requires a Bland TTS voice (BTTS, BTTS_V2, or BTTS_V3).
Every generation is automatically stored and retrievable via List TTS Generations and Get TTS Generation.
For lower-latency streaming with chunked WAV-header backfill, use Stream Speech instead.

Pricing

Text-to-speech pricing scales with plan:
PlanRate
Start$0.02 per 100 characters
Build$0.02 per 150 characters
Scale$0.02 per 200 characters
Enterprise$0.02 per 400 characters
Some voices in the public library carry an additional per-character creator fee. The exact cost for a generation is returned in the x-cost response header.

Headers

authorization
string
required
Your API key for authentication.

Body Parameters

text
string
required
The text to synthesize. Maximum 5,000 characters per request.Supports pause markers in the form <|N|> where N is a duration between 0.1 and 10.0 seconds, for example: "Welcome to Bland. <|0.8|> Let's get started."
voice_id
string
required
ID of the Bland TTS voice to use. Pass either the voice UUID from List Voices or a curated voice name (for example willow, juniper, valentine_experimental).
output_format
string
default:"pcm_44100"
Audio container/sample rate of the response.
  • pcm_8000, 8 kHz PCM16 mono (telephony)
  • pcm_16000, 16 kHz PCM16 mono
  • pcm_24000, 24 kHz PCM16 mono
  • pcm_44100, 44.1 kHz PCM16 mono (default, studio)
  • ulaw_8000, 8 kHz u-law mono (telephony)
stream
boolean
default:"false"
When true, the response is sent with Transfer-Encoding: chunked as audio is generated. The first 44 bytes are a WAV header with placeholder sizes (0xFFFFFFFF). Subsequent chunks are PCM16 audio data. The client backfills bytes 4-7 (RIFF chunk size minus 8) and 40-43 (data chunk size) after the stream closes.
language
string
default:"en"
Language code for synthesis. Defaults to the voice’s primary language. Available languages depend on the voice’s underlying model (V2 and V3 voices support 17+ languages).
consistency
number
Voice consistency control.
  • For BTTS (V1) voices, a float between 0.0 and 1.0. Higher values produce more consistent output.
  • For BTTS_V2 and BTTS_V3 voices, an integer between 1 and 32 (per_decode). Lower values produce more consistent output.
expressiveness
number
Expressiveness control for BTTS (V1) voices only. Float between 0.0 and 1.0. Higher values produce more expressive speech.
boost
integer
Expressiveness boost flag for BTTS_V2 and BTTS_V3 voices. 0 (off) or 1 (on).

Response

Returns a binary WAV file with Content-Type: audio/x-wav. Inspect response headers for latency and cost.
x-latency
string
Time in milliseconds from request to first audio byte (streaming) or full response (non-streaming).
x-cost
string
Cost in USD for the synthesis, matching what was billed.
Content-Type
string
Always audio/x-wav on success.
Content-Length
string
Total bytes in the audio (non-streaming only).
Transfer-Encoding
string
chunked when stream: true.
HTTP/1.1 200 OK
Content-Type: audio/x-wav
Content-Length: 98806
x-latency: 396
x-cost: 0.001

<WAV binary>

Docs for agents: llms.txt