Stream Speech
Streaming
Stream Speech
Generate speech audio with chunked transfer for the lowest possible time-to-first-byte.
POST
Stream Speech
Overview
Streams synthesized audio to the client as it is generated, usingTransfer-Encoding: chunked. Functionally equivalent to Speak with stream: true, but mounted as its own canonical endpoint for parity with our internal services and to avoid prefix conflicts with the non-streaming handler.
Use this endpoint when you care about time-to-first-byte (live preview, voice assistants, IVRs). Use Speak when you want to buffer the full audio before delivering it.
Requires a Bland TTS voice (
BTTS, BTTS_V2, or BTTS_V3). Other voice services are not supported on this endpoint.Every generation is automatically stored and retrievable via List TTS Generations and Get TTS Generation, the same as the non-streaming endpoint.
Headers
Your API key for authentication.
Body Parameters
The text to synthesize. Maximum 5,000 characters per request. Supports pause markers in the form
<|N|> (0.1-10.0 seconds).ID of the Bland TTS voice to use. Pass either the voice UUID from List Voices or a curated voice name.
Audio container/sample rate.
pcm_8000,pcm_16000,pcm_24000,pcm_44100,ulaw_8000
Language code. Defaults to the voice’s primary language.
V1: float 0.0-1.0 (higher = more consistent). V2/V3: integer 1-32 (lower = more consistent).
V1 only. Float 0.0-1.0.
V2/V3 only.
0 or 1.Streaming response format
The response is a single WAV file delivered in chunks:- First 44 bytes, a standard WAV header with placeholder sizes: bytes 4-7 (RIFF chunk size) and 40-43 (data chunk size) are both filled with
0xFFFFFFFFbecause the final length is not yet known. - Subsequent chunks, raw PCM16 audio data, written as it is synthesized.
- After the stream closes, the client patches the WAV header in place: bytes 4-7 become the total file size minus 8, bytes 40-43 become the total data chunk size. Most decoders ignore the placeholder size and play the file fine without the patch, but tools that strictly validate the header will need it.
Content-Type is audio/x-wav. No Content-Length header is sent.
Time in milliseconds from request to first audio byte.
Cost in USD for the synthesis.
Always
chunked.Docs for agents: llms.txt