Skip to main content
POST
/
v1
/
speak
/
stream
Stream Speech
curl --request POST \
  --url https://api.bland.ai/v1/speak/stream \
  --header 'Content-Type: application/json' \
  --header 'authorization: <authorization>' \
  --data '
{
  "text": "<string>",
  "voice_id": "<string>",
  "output_format": "<string>",
  "language": "<string>",
  "consistency": 123,
  "expressiveness": 123,
  "boost": 123
}
'
HTTP/1.1 200 OK
Content-Type: audio/x-wav
Transfer-Encoding: chunked
x-latency: 396
x-cost: 0.001

<chunked WAV with placeholder header sizes followed by PCM16 data>

Overview

Streams synthesized audio to the client as it is generated, using Transfer-Encoding: chunked. Functionally equivalent to Speak with stream: true, but mounted as its own canonical endpoint for parity with our internal services and to avoid prefix conflicts with the non-streaming handler. Use this endpoint when you care about time-to-first-byte (live preview, voice assistants, IVRs). Use Speak when you want to buffer the full audio before delivering it.
Requires a Bland TTS voice (BTTS, BTTS_V2, or BTTS_V3). Other voice services are not supported on this endpoint.
Every generation is automatically stored and retrievable via List TTS Generations and Get TTS Generation, the same as the non-streaming endpoint.

Headers

authorization
string
required
Your API key for authentication.

Body Parameters

text
string
required
The text to synthesize. Maximum 5,000 characters per request. Supports pause markers in the form <|N|> (0.1-10.0 seconds).
voice_id
string
required
ID of the Bland TTS voice to use. Pass either the voice UUID from List Voices or a curated voice name.
output_format
string
default:"pcm_44100"
Audio container/sample rate.
  • pcm_8000, pcm_16000, pcm_24000, pcm_44100, ulaw_8000
language
string
Language code. Defaults to the voice’s primary language.
consistency
number
V1: float 0.0-1.0 (higher = more consistent). V2/V3: integer 1-32 (lower = more consistent).
expressiveness
number
V1 only. Float 0.0-1.0.
boost
integer
V2/V3 only. 0 or 1.

Streaming response format

The response is a single WAV file delivered in chunks:
  1. First 44 bytes, a standard WAV header with placeholder sizes: bytes 4-7 (RIFF chunk size) and 40-43 (data chunk size) are both filled with 0xFFFFFFFF because the final length is not yet known.
  2. Subsequent chunks, raw PCM16 audio data, written as it is synthesized.
  3. After the stream closes, the client patches the WAV header in place: bytes 4-7 become the total file size minus 8, bytes 40-43 become the total data chunk size. Most decoders ignore the placeholder size and play the file fine without the patch, but tools that strictly validate the header will need it.
Content-Type is audio/x-wav. No Content-Length header is sent.
x-latency
string
Time in milliseconds from request to first audio byte.
x-cost
string
Cost in USD for the synthesis.
Transfer-Encoding
string
Always chunked.
HTTP/1.1 200 OK
Content-Type: audio/x-wav
Transfer-Encoding: chunked
x-latency: 396
x-cost: 0.001

<chunked WAV with placeholder header sizes followed by PCM16 data>

Docs for agents: llms.txt