Sunbelt Computer Software

Speech SDK

A lightweight, provider-agnostic TypeScript SDK for text-to-speech. One API, 13 providers, zero lock-in. Runs in Node.js, Edge runtimes, and the browser.

Learn more at speechsdk.dev.

Features

Universal — generateSpeech() works across OpenAI, ElevenLabs, Deepgram, Cartesia, Hume, Google Gemini TTS, Fish Audio, Inworld, Murf, Resemble, fal, Mistral, and xAI.
Streaming — streamSpeech() returns a standard ReadableStream<Uint8Array>.
Conversations — generateConversation() produces multi-speaker audio, using native dialogue endpoints when available and stitching locally when not.
Word-level timestamps — timestamps: "on" returns alignment, using the provider's native data or falling back to STT.
Volume normalization — RMS-level outputs to an absolute loudness target.
Audio tags & voice cloning — [laugh], [sigh], emotion cues; reference-audio cloning where supported.

Install

npm install @speech-sdk/core

Tip

Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk.

Quick start

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello from speech-sdk!',
  voice: 'alloy',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

Pass a provider/model string, or just the provider name to use its default model. API keys are read from env vars automatically.

Supported providers

Provider	Prefix	Default model	Env var
OpenAI	`openai`	`gpt-4o-mini-tts`	`OPENAI_API_KEY`
ElevenLabs	`elevenlabs`	`eleven_multilingual_v2`	`ELEVENLABS_API_KEY`
Deepgram	`deepgram`	`aura-2`	`DEEPGRAM_API_KEY`
Cartesia	`cartesia`	`sonic-3`	`CARTESIA_API_KEY`
Hume	`hume`	`octave-2`	`HUME_API_KEY`
Inworld	`inworld`	`inworld-tts-1.5-max`	`INWORLD_API_KEY`
Google Gemini TTS	`google`	`gemini-2.5-flash-preview-tts`	`GOOGLE_API_KEY`
Fish Audio	`fish-audio`	`s2-pro`	`FISH_AUDIO_API_KEY`
Murf	`murf`	`GEN2`	`MURF_API_KEY`
Resemble	`resemble`	`default`	`RESEMBLE_API_KEY`
fal	`fal-ai`	(user-specified)	`FAL_API_KEY`
Mistral	`mistral`	`voxtral-mini-tts-2603`	`MISTRAL_API_KEY`
xAI	`xai`	`grok-tts`	`XAI_API_KEY`

Provider-specific parameters pass through via providerOptions using each API's native field names.

Streaming

streamSpeech() returns audio incrementally as a ReadableStream<Uint8Array>.

import { streamSpeech } from '@speech-sdk/core';

const { audio, mediaType } = await streamSpeech({
  model: 'cartesia/sonic-3',
  text: 'Streaming straight to the client.',
  voice: 'voice-id',
});

// Forward to an HTTP response:
return new Response(audio, { headers: { 'Content-Type': mediaType } });

Note

Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling streamSpeech() on a non-streaming model throws StreamingNotSupportedError.

Conversations

generateConversation() produces a single multi-voice clip from an ordered array of turns, picking the best path automatically:

Native dialogue — one provider with a multi-speaker endpoint (ElevenLabs v3, Gemini TTS, Hume Octave, Fish Audio S2-Pro, fal Dia). One API call, natural mix.
Stitch fallback — multi-provider or no dialogue endpoint. Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.

import { generateConversation } from '@speech-sdk/core/conversation';

const result = await generateConversation({
  turns: [
    { model: 'openai/tts-1',                     voice: 'nova',                 text: "Hi, I'm hosted by OpenAI." },
    { model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
    { model: 'hume/octave-2',                    voice: 'Kora',                 text: "I'm Hume Octave. Thanks for listening." },
  ],
});

Options: gapMs (default 300), normalizeVolume (default true), volumeDbfs (default -20), maxConcurrency (default 6), maxRetries (default 2), timestamps, timestampProvider, apiKey, providerOptions, abortSignal, headers. Per-turn overrides: model, providerOptions (stitch path only — throws ConversationInputError on native).

Native dialogue caps:

Provider	Models	Voice constraints
ElevenLabs	`eleven_v3`	1–10 voices, ≤ 2,000 chars
Google	`gemini-2.5-{flash,pro}-preview-tts`, `gemini-3.1-flash-tts-preview`	Exactly 2 voices
Hume	`octave-1`, `octave-2`	1–4 voices
Fish Audio	`s2-pro`	1–4 voices

Timestamps

Pass timestamps to get word-level alignment. Timings are in seconds from the start of the audio.

const result = await generateSpeech({
  model: 'elevenlabs/eleven_multilingual_v2',
  text: 'Hello from speech-sdk!',
  voice: 'JBFqnCBsd6RMkjVDRZzb',
  timestamps: 'on',
});

result.timestamps;
// [
//   { text: "Hello",  start: 0.00, end: 0.32 },
//   { text: "from",   start: 0.36, end: 0.55 },
//   ...
// ]

Mode	Behavior
`"auto"` (default)	Return timestamps only if the provider supplies them natively. Free.
`"on"`	Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency).
`"off"`	Never return timestamps.

On "on", the fallback defaults to OpenAI Whisper (openai/whisper-1, needs OPENAI_API_KEY). Override by constructing a ResolvedSTTModel via a factory and passing it as timestampProvider:

import { createOpenAISTT } from '@speech-sdk/core/stt/openai';

await generateSpeech({
  model: 'cartesia/sonic-3',
  text: 'Hello!',
  voice: 'voice-id',
  timestamps: 'on',
  timestampProvider: createOpenAISTT({ apiKey: process.env.MY_WHISPER_KEY })('whisper-1'),
});

Per-provider support:

Provider	Timestamps
ElevenLabs (`eleven_v3`, `eleven_multilingual_v2`, `eleven_flash_v2`, `eleven_flash_v2_5`)	Native — returned in the TTS response, free on `"auto"`
Murf (`GEN2`)	Native — `wordDurations` returned in the TTS response, free on `"auto"` (FALCON streaming model has no native alignment)
Hume (`octave-2`)	Native — word alignment from the JSON `/v0/tts` endpoint, free on `"auto"` (`octave-1` has no native alignment)
Inworld (`inworld-tts-1.5-max`, `inworld-tts-1.5-mini`)	Native — `timestampInfo.wordAlignment` returned in the TTS response, free on `"auto"` (best on English/Spanish)
Cartesia (`sonic-3`, `sonic-2`)	Native — routed through `/tts/sse` with `add_timestamps: true`; merges interleaved chunk + timestamps events into audio + `WordTimestamp[]`
Resemble (`default`)	Native — `audio_timestamps` always returned by `/synthesize`; SDK aggregates grapheme-level timing into words (mirrors ElevenLabs aggregator)
All others (OpenAI, Deepgram, Google, Fish Audio, fal, Mistral, xAI)	No native alignment; `"on"` transcribes via the STT fallback, `"auto"` returns `undefined`

generateConversation accepts the same options and returns a flat WordTimestamp[] across all turns — stitch-path timings are offset by cumulative turn duration + gap.

Captions (SRT / WebVTT)

Convert word-level timestamps into a caption file. SRT is the default; pass format: 'vtt' for WebVTT (required for HTML <track>).

import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';

const { timestamps } = await generateSpeech({
  model: 'elevenlabs/eleven_v3',
  text: 'Hello world. This is a test.',
  voice: 'JBFqnCBsd6RMkjVDRZzb',
  timestamps: 'on',
});

const srt = timestampsToCaptions(timestamps ?? []);
// 1
// 00:00:00,000 --> 00:00:01,200
// Hello world.
//
// 2
// 00:00:01,300 --> 00:00:02,800
// This is a test.

const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
// WEBVTT
//
// 1
// 00:00:00.000 --> 00:00:01.200
// Hello world.
//
// 2
// 00:00:01.300 --> 00:00:02.800
// This is a test.

Output follows the SubRip and W3C WebVTT conventions: comma-decimal (SRT) vs period-decimal (VTT) timestamps, sequential numeric cue IDs, blank-line cue separators with a trailing blank line, and HTML-escaped body text (&, <, >) on the VTT path.

Cues break on sentence boundaries (., !, ?), then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass CaptionsOptions to customize format, maxLineLength, maxLinesPerCue, maxCharsPerCue, maxCueDurationMs, or longPhraseCommaBreakChars.

Volume normalization

Pass volumeDbfs to RMS-normalize to an absolute target loudness (must be ≤ 0; -20 is the broadcast/podcast convention).

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello!',
  voice: 'alloy',
  volumeDbfs: -20,
});

result.audio.mediaType;  // "audio/wav" — re-encoded after normalization

generateConversation normalizes by default. Pass normalizeVolume: false to skip. Throws VolumeAdjustmentUnsupportedError if the provider has no decodable PCM/WAV mode.

Audio tags

Bracket syntax [tag] adds expressive cues. Unsupported tags are stripped with warnings in result.warnings.

await generateSpeech({
  model: 'elevenlabs/eleven_v3',
  text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
  voice: 'voice-id',
});

Provider	Behavior
OpenAI (`gpt-4o-mini-tts`)	Mapped to the `instructions` field
ElevenLabs (`eleven_v3`)	Passed through natively
Google (`gemini-3.1-flash-tts-preview`)	Passed through natively
Cartesia (`sonic-3`)	Emotion tags → SSML; `[laughter]` passed through; unknown stripped
All others	Stripped with warnings

Voice cloning

Some providers support reference-audio cloning. Pass a voice object instead of a string.

import { createMistral } from '@speech-sdk/core/mistral';
import { createFal } from '@speech-sdk/core/fal-ai';

// Base64 reference:
await generateSpeech({
  model: createMistral()(),
  text: 'Hello!',
  voice: { audio: 'base64-encoded-audio...' },
});

// URL reference:
await generateSpeech({
  model: createFal()('fal-ai/f5-tts'),
  text: 'Hello!',
  voice: { url: 'https://example.com/reference.wav' },
});

Custom configuration

Factory functions give you custom API keys, base URLs, or fetch implementations:

import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';

const myOpenAI = createOpenAI({
  apiKey: 'sk-...',
  baseURL: 'https://my-proxy.com/v1',
});

await generateSpeech({
  model: myOpenAI('gpt-4o-mini-tts'),
  text: 'Hello!',
  voice: 'alloy',
});

API reference

generateSpeech({
  model: string | ResolvedModel,          // required
  text: string,                           // required
  voice: Voice,                           // required — string | { url } | { audio }
  providerOptions?: object,
  volumeDbfs?: number,                    // ≤ 0
  timestamps?: "on" | "auto" | "off",     // default "auto"
  timestampProvider?: ResolvedSTTModel,   // override the STT fallback
  maxRetries?: number,                    // default 2
  abortSignal?: AbortSignal,
  headers?: Record<string, string>,
}): Promise<SpeechResult>

interface SpeechResult {
  audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
  metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
  timestamps?: WordTimestamp[];
  providerMetadata?: Record<string, unknown>;
  warnings?: string[];
}

interface WordTimestamp { text: string; start: number; end: number }  // seconds

Error handling

import { generateSpeech, ApiError } from '@speech-sdk/core';

try {
  await generateSpeech({ /* ... */ });
} catch (error) {
  if (error instanceof ApiError) {
    error.statusCode;    // 401, 429, 500, ...
    error.model;         // "openai/gpt-4o-mini-tts"
    error.responseBody;
  }
}

Retries 5xx and network errors with exponential backoff (p-retry); does not retry 4xx. Default 2 retries; override via maxRetries.

Development

pnpm install
pnpm test              # unit tests
pnpm run test:e2e      # e2e tests (requires provider API keys)
pnpm run typecheck
pnpm fix               # format + lint

E2E tests hit real provider APIs. Set the relevant keys in .env or export them. Set SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos to write conversation e2e audio to disk.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.claude		.claude
.github		.github
skills/speech-sdk		skills/speech-sdk
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
biome.jsonc		biome.jsonc
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
vitest.config.e2e.ts		vitest.config.e2e.ts
vitest.config.ts		vitest.config.ts

Error	When
`ApiError`	Provider returned non-2xx
`NoSpeechGeneratedError`	Empty input (after tag stripping) or empty provider response
`StreamingNotSupportedError`	`streamSpeech()` on a non-streaming model
`VolumeAdjustmentUnsupportedError`	`volumeDbfs` with no decodable output mode
`TimestampKeyMissingError`	`timestamps: "on"` fallback key missing
`ConversationInputError` / `DialogueConstraintError` / `StitchUnsupportedError`	`generateConversation` validation / native caps / stitch incompatibility
`SpeechSDKError`	Base class

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech SDK

Features

Contents

Install

Quick start

Supported providers

Streaming

Conversations

Timestamps

Captions (SRT / WebVTT)

Volume normalization

Audio tags

Voice cloning

Custom configuration

API reference

Error handling

Development

About

Uh oh!

Releases 17

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

Speech SDK

Features

Contents

Install

Quick start

Supported providers

Streaming

Conversations

Timestamps

Captions (SRT / WebVTT)

Volume normalization

Audio tags

Voice cloning

Custom configuration

API reference

Error handling

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages