A lightweight, provider-agnostic TypeScript SDK for text-to-speech. One API, 13 providers, zero lock-in. Runs in Node.js, Edge runtimes, and the browser.
Learn more at speechsdk.dev.
- Universal —
generateSpeech()works across OpenAI, ElevenLabs, Deepgram, Cartesia, Hume, Google Gemini TTS, Fish Audio, Inworld, Murf, Resemble, fal, Mistral, and xAI. - Streaming —
streamSpeech()returns a standardReadableStream<Uint8Array>. - Conversations —
generateConversation()produces multi-speaker audio, using native dialogue endpoints when available and stitching locally when not. - Word-level timestamps —
timestamps: "on"returns alignment, using the provider's native data or falling back to STT. - Volume normalization — RMS-level outputs to an absolute loudness target.
- Audio tags & voice cloning —
[laugh],[sigh], emotion cues; reference-audio cloning where supported.
- Install · Quick start · Supported providers
- Streaming · Conversations · Timestamps
- Volume normalization · Audio tags · Voice cloning
- Custom configuration · API reference · Error handling · Development
npm install @speech-sdk/coreTip
Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk.
import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: 'Hello from speech-sdk!',
voice: 'alloy',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"Pass a provider/model string, or just the provider name to use its default model. API keys are read from env vars automatically.
| Provider | Prefix | Default model | Env var |
|---|---|---|---|
| OpenAI | openai |
gpt-4o-mini-tts |
OPENAI_API_KEY |
| ElevenLabs | elevenlabs |
eleven_multilingual_v2 |
ELEVENLABS_API_KEY |
| Deepgram | deepgram |
aura-2 |
DEEPGRAM_API_KEY |
| Cartesia | cartesia |
sonic-3 |
CARTESIA_API_KEY |
| Hume | hume |
octave-2 |
HUME_API_KEY |
| Inworld | inworld |
inworld-tts-1.5-max |
INWORLD_API_KEY |
| Google Gemini TTS | google |
gemini-2.5-flash-preview-tts |
GOOGLE_API_KEY |
| Fish Audio | fish-audio |
s2-pro |
FISH_AUDIO_API_KEY |
| Murf | murf |
GEN2 |
MURF_API_KEY |
| Resemble | resemble |
default |
RESEMBLE_API_KEY |
| fal | fal-ai |
(user-specified) | FAL_API_KEY |
| Mistral | mistral |
voxtral-mini-tts-2603 |
MISTRAL_API_KEY |
| xAI | xai |
grok-tts |
XAI_API_KEY |
Provider-specific parameters pass through via providerOptions using each API's native field names.
streamSpeech() returns audio incrementally as a ReadableStream<Uint8Array>.
import { streamSpeech } from '@speech-sdk/core';
const { audio, mediaType } = await streamSpeech({
model: 'cartesia/sonic-3',
text: 'Streaming straight to the client.',
voice: 'voice-id',
});
// Forward to an HTTP response:
return new Response(audio, { headers: { 'Content-Type': mediaType } });Note
Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling streamSpeech() on a non-streaming model throws StreamingNotSupportedError.
generateConversation() produces a single multi-voice clip from an ordered array of turns, picking the best path automatically:
- Native dialogue — one provider with a multi-speaker endpoint (ElevenLabs v3, Gemini TTS, Hume Octave, Fish Audio S2-Pro, fal Dia). One API call, natural mix.
- Stitch fallback — multi-provider or no dialogue endpoint. Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.
import { generateConversation } from '@speech-sdk/core/conversation';
const result = await generateConversation({
turns: [
{ model: 'openai/tts-1', voice: 'nova', text: "Hi, I'm hosted by OpenAI." },
{ model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
{ model: 'hume/octave-2', voice: 'Kora', text: "I'm Hume Octave. Thanks for listening." },
],
});Options: gapMs (default 300), normalizeVolume (default true), volumeDbfs (default -20), maxConcurrency (default 6), maxRetries (default 2), timestamps, timestampProvider, apiKey, providerOptions, abortSignal, headers. Per-turn overrides: model, providerOptions (stitch path only — throws ConversationInputError on native).
Native dialogue caps:
| Provider | Models | Voice constraints |
|---|---|---|
| ElevenLabs | eleven_v3 |
1–10 voices, ≤ 2,000 chars |
gemini-2.5-{flash,pro}-preview-tts, gemini-3.1-flash-tts-preview |
Exactly 2 voices | |
| Hume | octave-1, octave-2 |
1–4 voices |
| Fish Audio | s2-pro |
1–4 voices |
Pass timestamps to get word-level alignment. Timings are in seconds from the start of the audio.
const result = await generateSpeech({
model: 'elevenlabs/eleven_multilingual_v2',
text: 'Hello from speech-sdk!',
voice: 'JBFqnCBsd6RMkjVDRZzb',
timestamps: 'on',
});
result.timestamps;
// [
// { text: "Hello", start: 0.00, end: 0.32 },
// { text: "from", start: 0.36, end: 0.55 },
// ...
// ]| Mode | Behavior |
|---|---|
"auto" (default) |
Return timestamps only if the provider supplies them natively. Free. |
"on" |
Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency). |
"off" |
Never return timestamps. |
On "on", the fallback defaults to OpenAI Whisper (openai/whisper-1, needs OPENAI_API_KEY). Override by constructing a ResolvedSTTModel via a factory and passing it as timestampProvider:
import { createOpenAISTT } from '@speech-sdk/core/stt/openai';
await generateSpeech({
model: 'cartesia/sonic-3',
text: 'Hello!',
voice: 'voice-id',
timestamps: 'on',
timestampProvider: createOpenAISTT({ apiKey: process.env.MY_WHISPER_KEY })('whisper-1'),
});Per-provider support:
| Provider | Timestamps |
|---|---|
ElevenLabs (eleven_v3, eleven_multilingual_v2, eleven_flash_v2, eleven_flash_v2_5) |
Native — returned in the TTS response, free on "auto" |
Murf (GEN2) |
Native — wordDurations returned in the TTS response, free on "auto" (FALCON streaming model has no native alignment) |
Hume (octave-2) |
Native — word alignment from the JSON /v0/tts endpoint, free on "auto" (octave-1 has no native alignment) |
Inworld (inworld-tts-1.5-max, inworld-tts-1.5-mini) |
Native — timestampInfo.wordAlignment returned in the TTS response, free on "auto" (best on English/Spanish) |
Cartesia (sonic-3, sonic-2) |
Native — routed through /tts/sse with add_timestamps: true; merges interleaved chunk + timestamps events into audio + WordTimestamp[] |
Resemble (default) |
Native — audio_timestamps always returned by /synthesize; SDK aggregates grapheme-level timing into words (mirrors ElevenLabs aggregator) |
| All others (OpenAI, Deepgram, Google, Fish Audio, fal, Mistral, xAI) | No native alignment; "on" transcribes via the STT fallback, "auto" returns undefined |
generateConversation accepts the same options and returns a flat WordTimestamp[] across all turns — stitch-path timings are offset by cumulative turn duration + gap.
Convert word-level timestamps into a caption file. SRT is the default; pass format: 'vtt' for WebVTT (required for HTML <track>).
import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';
const { timestamps } = await generateSpeech({
model: 'elevenlabs/eleven_v3',
text: 'Hello world. This is a test.',
voice: 'JBFqnCBsd6RMkjVDRZzb',
timestamps: 'on',
});
const srt = timestampsToCaptions(timestamps ?? []);
// 1
// 00:00:00,000 --> 00:00:01,200
// Hello world.
//
// 2
// 00:00:01,300 --> 00:00:02,800
// This is a test.
const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
// WEBVTT
//
// 1
// 00:00:00.000 --> 00:00:01.200
// Hello world.
//
// 2
// 00:00:01.300 --> 00:00:02.800
// This is a test.Output follows the SubRip and W3C WebVTT conventions: comma-decimal (SRT) vs period-decimal (VTT) timestamps, sequential numeric cue IDs, blank-line cue separators with a trailing blank line, and HTML-escaped body text (&, <, >) on the VTT path.
Cues break on sentence boundaries (., !, ?), then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass CaptionsOptions to customize format, maxLineLength, maxLinesPerCue, maxCharsPerCue, maxCueDurationMs, or longPhraseCommaBreakChars.
Pass volumeDbfs to RMS-normalize to an absolute target loudness (must be ≤ 0; -20 is the broadcast/podcast convention).
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: 'Hello!',
voice: 'alloy',
volumeDbfs: -20,
});
result.audio.mediaType; // "audio/wav" — re-encoded after normalizationgenerateConversation normalizes by default. Pass normalizeVolume: false to skip. Throws VolumeAdjustmentUnsupportedError if the provider has no decodable PCM/WAV mode.
Bracket syntax [tag] adds expressive cues. Unsupported tags are stripped with warnings in result.warnings.
await generateSpeech({
model: 'elevenlabs/eleven_v3',
text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
voice: 'voice-id',
});| Provider | Behavior |
|---|---|
OpenAI (gpt-4o-mini-tts) |
Mapped to the instructions field |
ElevenLabs (eleven_v3) |
Passed through natively |
Google (gemini-3.1-flash-tts-preview) |
Passed through natively |
Cartesia (sonic-3) |
Emotion tags → SSML; [laughter] passed through; unknown stripped |
| All others | Stripped with warnings |
Some providers support reference-audio cloning. Pass a voice object instead of a string.
import { createMistral } from '@speech-sdk/core/mistral';
import { createFal } from '@speech-sdk/core/fal-ai';
// Base64 reference:
await generateSpeech({
model: createMistral()(),
text: 'Hello!',
voice: { audio: 'base64-encoded-audio...' },
});
// URL reference:
await generateSpeech({
model: createFal()('fal-ai/f5-tts'),
text: 'Hello!',
voice: { url: 'https://example.com/reference.wav' },
});Factory functions give you custom API keys, base URLs, or fetch implementations:
import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';
const myOpenAI = createOpenAI({
apiKey: 'sk-...',
baseURL: 'https://my-proxy.com/v1',
});
await generateSpeech({
model: myOpenAI('gpt-4o-mini-tts'),
text: 'Hello!',
voice: 'alloy',
});generateSpeech({
model: string | ResolvedModel, // required
text: string, // required
voice: Voice, // required — string | { url } | { audio }
providerOptions?: object,
volumeDbfs?: number, // ≤ 0
timestamps?: "on" | "auto" | "off", // default "auto"
timestampProvider?: ResolvedSTTModel, // override the STT fallback
maxRetries?: number, // default 2
abortSignal?: AbortSignal,
headers?: Record<string, string>,
}): Promise<SpeechResult>
interface SpeechResult {
audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
timestamps?: WordTimestamp[];
providerMetadata?: Record<string, unknown>;
warnings?: string[];
}
interface WordTimestamp { text: string; start: number; end: number } // secondsimport { generateSpeech, ApiError } from '@speech-sdk/core';
try {
await generateSpeech({ /* ... */ });
} catch (error) {
if (error instanceof ApiError) {
error.statusCode; // 401, 429, 500, ...
error.model; // "openai/gpt-4o-mini-tts"
error.responseBody;
}
}Retries 5xx and network errors with exponential backoff (p-retry); does not retry 4xx. Default 2 retries; override via maxRetries.
pnpm install
pnpm test # unit tests
pnpm run test:e2e # e2e tests (requires provider API keys)
pnpm run typecheck
pnpm fix # format + lintE2E tests hit real provider APIs. Set the relevant keys in .env or export them. Set SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos to write conversation e2e audio to disk.
