GitHub - Jellypod-Inc/speech-sdk: Universal Text-To-Speech TypeScript SDK with Multi-Provider Support. · GitHub
Skip to content

Jellypod-Inc/speech-sdk

Repository files navigation

Speech SDK

npm version npm downloads license Discord

A lightweight, provider-agnostic TypeScript SDK for text-to-speech. One API, 13 providers, zero lock-in. Runs in Node.js, Edge runtimes, and the browser.

Speech SDK

Learn more at speechsdk.dev.

Features

  • UniversalgenerateSpeech() works across OpenAI, ElevenLabs, Deepgram, Cartesia, Hume, Google Gemini TTS, Fish Audio, Inworld, Murf, Resemble, fal, Mistral, and xAI.
  • StreamingstreamSpeech() returns a standard ReadableStream<Uint8Array>.
  • ConversationsgenerateConversation() produces multi-speaker audio, using native dialogue endpoints when available and stitching locally when not.
  • Word-level timestampstimestamps: "on" returns alignment, using the provider's native data or falling back to STT.
  • Volume normalization — RMS-level outputs to an absolute loudness target.
  • Audio tags & voice cloning[laugh], [sigh], emotion cues; reference-audio cloning where supported.

Contents

Install

npm install @speech-sdk/core

Tip

Using an AI coding assistant? Add the speech-sdk skill to give it full knowledge of this library: npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk.

Quick start

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello from speech-sdk!',
  voice: 'alloy',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

Pass a provider/model string, or just the provider name to use its default model. API keys are read from env vars automatically.

Supported providers

Provider Prefix Default model Env var
OpenAI openai gpt-4o-mini-tts OPENAI_API_KEY
ElevenLabs elevenlabs eleven_multilingual_v2 ELEVENLABS_API_KEY
Deepgram deepgram aura-2 DEEPGRAM_API_KEY
Cartesia cartesia sonic-3 CARTESIA_API_KEY
Hume hume octave-2 HUME_API_KEY
Inworld inworld inworld-tts-1.5-max INWORLD_API_KEY
Google Gemini TTS google gemini-2.5-flash-preview-tts GOOGLE_API_KEY
Fish Audio fish-audio s2-pro FISH_AUDIO_API_KEY
Murf murf GEN2 MURF_API_KEY
Resemble resemble default RESEMBLE_API_KEY
fal fal-ai (user-specified) FAL_API_KEY
Mistral mistral voxtral-mini-tts-2603 MISTRAL_API_KEY
xAI xai grok-tts XAI_API_KEY

Provider-specific parameters pass through via providerOptions using each API's native field names.

Streaming

streamSpeech() returns audio incrementally as a ReadableStream<Uint8Array>.

import { streamSpeech } from '@speech-sdk/core';

const { audio, mediaType } = await streamSpeech({
  model: 'cartesia/sonic-3',
  text: 'Streaming straight to the client.',
  voice: 'voice-id',
});

// Forward to an HTTP response:
return new Response(audio, { headers: { 'Content-Type': mediaType } });

Note

Retries apply only until response headers arrive; mid-stream errors propagate to the consumer. Calling streamSpeech() on a non-streaming model throws StreamingNotSupportedError.

Conversations

generateConversation() produces a single multi-voice clip from an ordered array of turns, picking the best path automatically:

  • Native dialogue — one provider with a multi-speaker endpoint (ElevenLabs v3, Gemini TTS, Hume Octave, Fish Audio S2-Pro, fal Dia). One API call, natural mix.
  • Stitch fallback — multi-provider or no dialogue endpoint. Runs turns in parallel, RMS-levels each, inserts silence, returns a single WAV.
import { generateConversation } from '@speech-sdk/core/conversation';

const result = await generateConversation({
  turns: [
    { model: 'openai/tts-1',                     voice: 'nova',                 text: "Hi, I'm hosted by OpenAI." },
    { model: 'elevenlabs/eleven_multilingual_v2', voice: 'JBFqnCBsd6RMkjVDRZzb', text: "And I'm hosted by ElevenLabs." },
    { model: 'hume/octave-2',                    voice: 'Kora',                 text: "I'm Hume Octave. Thanks for listening." },
  ],
});

Options: gapMs (default 300), normalizeVolume (default true), volumeDbfs (default -20), maxConcurrency (default 6), maxRetries (default 2), timestamps, timestampProvider, apiKey, providerOptions, abortSignal, headers. Per-turn overrides: model, providerOptions (stitch path only — throws ConversationInputError on native).

Native dialogue caps:

Provider Models Voice constraints
ElevenLabs eleven_v3 1–10 voices, ≤ 2,000 chars
Google gemini-2.5-{flash,pro}-preview-tts, gemini-3.1-flash-tts-preview Exactly 2 voices
Hume octave-1, octave-2 1–4 voices
Fish Audio s2-pro 1–4 voices

Timestamps

Pass timestamps to get word-level alignment. Timings are in seconds from the start of the audio.

const result = await generateSpeech({
  model: 'elevenlabs/eleven_multilingual_v2',
  text: 'Hello from speech-sdk!',
  voice: 'JBFqnCBsd6RMkjVDRZzb',
  timestamps: 'on',
});

result.timestamps;
// [
//   { text: "Hello",  start: 0.00, end: 0.32 },
//   { text: "from",   start: 0.36, end: 0.55 },
//   ...
// ]
Mode Behavior
"auto" (default) Return timestamps only if the provider supplies them natively. Free.
"on" Always return timestamps. Uses native alignment when available; otherwise transcribes the audio via STT (extra cost + latency).
"off" Never return timestamps.

On "on", the fallback defaults to OpenAI Whisper (openai/whisper-1, needs OPENAI_API_KEY). Override by constructing a ResolvedSTTModel via a factory and passing it as timestampProvider:

import { createOpenAISTT } from '@speech-sdk/core/stt/openai';

await generateSpeech({
  model: 'cartesia/sonic-3',
  text: 'Hello!',
  voice: 'voice-id',
  timestamps: 'on',
  timestampProvider: createOpenAISTT({ apiKey: process.env.MY_WHISPER_KEY })('whisper-1'),
});

Per-provider support:

Provider Timestamps
ElevenLabs (eleven_v3, eleven_multilingual_v2, eleven_flash_v2, eleven_flash_v2_5) Native — returned in the TTS response, free on "auto"
Murf (GEN2) NativewordDurations returned in the TTS response, free on "auto" (FALCON streaming model has no native alignment)
Hume (octave-2) Native — word alignment from the JSON /v0/tts endpoint, free on "auto" (octave-1 has no native alignment)
Inworld (inworld-tts-1.5-max, inworld-tts-1.5-mini) NativetimestampInfo.wordAlignment returned in the TTS response, free on "auto" (best on English/Spanish)
Cartesia (sonic-3, sonic-2) Native — routed through /tts/sse with add_timestamps: true; merges interleaved chunk + timestamps events into audio + WordTimestamp[]
Resemble (default) Nativeaudio_timestamps always returned by /synthesize; SDK aggregates grapheme-level timing into words (mirrors ElevenLabs aggregator)
All others (OpenAI, Deepgram, Google, Fish Audio, fal, Mistral, xAI) No native alignment; "on" transcribes via the STT fallback, "auto" returns undefined

generateConversation accepts the same options and returns a flat WordTimestamp[] across all turns — stitch-path timings are offset by cumulative turn duration + gap.

Captions (SRT / WebVTT)

Convert word-level timestamps into a caption file. SRT is the default; pass format: 'vtt' for WebVTT (required for HTML <track>).

import { generateSpeech, timestampsToCaptions } from '@speech-sdk/core';

const { timestamps } = await generateSpeech({
  model: 'elevenlabs/eleven_v3',
  text: 'Hello world. This is a test.',
  voice: 'JBFqnCBsd6RMkjVDRZzb',
  timestamps: 'on',
});

const srt = timestampsToCaptions(timestamps ?? []);
// 1
// 00:00:00,000 --> 00:00:01,200
// Hello world.
//
// 2
// 00:00:01,300 --> 00:00:02,800
// This is a test.

const vtt = timestampsToCaptions(timestamps ?? [], { format: 'vtt' });
// WEBVTT
//
// 1
// 00:00:00.000 --> 00:00:01.200
// Hello world.
//
// 2
// 00:00:01.300 --> 00:00:02.800
// This is a test.

Output follows the SubRip and W3C WebVTT conventions: comma-decimal (SRT) vs period-decimal (VTT) timestamps, sequential numeric cue IDs, blank-line cue separators with a trailing blank line, and HTML-escaped body text (&, <, >) on the VTT path.

Cues break on sentence boundaries (., !, ?), then subdivide long sentences by character count, cue duration, and soft comma breaks. Pass CaptionsOptions to customize format, maxLineLength, maxLinesPerCue, maxCharsPerCue, maxCueDurationMs, or longPhraseCommaBreakChars.

Volume normalization

Pass volumeDbfs to RMS-normalize to an absolute target loudness (must be ≤ 0; -20 is the broadcast/podcast convention).

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello!',
  voice: 'alloy',
  volumeDbfs: -20,
});

result.audio.mediaType;  // "audio/wav" — re-encoded after normalization

generateConversation normalizes by default. Pass normalizeVolume: false to skip. Throws VolumeAdjustmentUnsupportedError if the provider has no decodable PCM/WAV mode.

Audio tags

Bracket syntax [tag] adds expressive cues. Unsupported tags are stripped with warnings in result.warnings.

await generateSpeech({
  model: 'elevenlabs/eleven_v3',
  text: '[laugh] Oh that is so funny! [sigh] But seriously though.',
  voice: 'voice-id',
});
Provider Behavior
OpenAI (gpt-4o-mini-tts) Mapped to the instructions field
ElevenLabs (eleven_v3) Passed through natively
Google (gemini-3.1-flash-tts-preview) Passed through natively
Cartesia (sonic-3) Emotion tags → SSML; [laughter] passed through; unknown stripped
All others Stripped with warnings

Voice cloning

Some providers support reference-audio cloning. Pass a voice object instead of a string.

import { createMistral } from '@speech-sdk/core/mistral';
import { createFal } from '@speech-sdk/core/fal-ai';

// Base64 reference:
await generateSpeech({
  model: createMistral()(),
  text: 'Hello!',
  voice: { audio: 'base64-encoded-audio...' },
});

// URL reference:
await generateSpeech({
  model: createFal()('fal-ai/f5-tts'),
  text: 'Hello!',
  voice: { url: 'https://example.com/reference.wav' },
});

Custom configuration

Factory functions give you custom API keys, base URLs, or fetch implementations:

import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';

const myOpenAI = createOpenAI({
  apiKey: 'sk-...',
  baseURL: 'https://my-proxy.com/v1',
});

await generateSpeech({
  model: myOpenAI('gpt-4o-mini-tts'),
  text: 'Hello!',
  voice: 'alloy',
});

API reference

generateSpeech({
  model: string | ResolvedModel,          // required
  text: string,                           // required
  voice: Voice,                           // required — string | { url } | { audio }
  providerOptions?: object,
  volumeDbfs?: number,                    // ≤ 0
  timestamps?: "on" | "auto" | "off",     // default "auto"
  timestampProvider?: ResolvedSTTModel,   // override the STT fallback
  maxRetries?: number,                    // default 2
  abortSignal?: AbortSignal,
  headers?: Record<string, string>,
}): Promise<SpeechResult>

interface SpeechResult {
  audio: { uint8Array: Uint8Array; base64: string; mediaType: string };
  metadata: { latencyMs: number; inputChars: number; provider: string; model: string; audioDurationMs?: number; ttfbMs?: number };
  timestamps?: WordTimestamp[];
  providerMetadata?: Record<string, unknown>;
  warnings?: string[];
}

interface WordTimestamp { text: string; start: number; end: number }  // seconds

Error handling

import { generateSpeech, ApiError } from '@speech-sdk/core';

try {
  await generateSpeech({ /* ... */ });
} catch (error) {
  if (error instanceof ApiError) {
    error.statusCode;    // 401, 429, 500, ...
    error.model;         // "openai/gpt-4o-mini-tts"
    error.responseBody;
  }
}
Error When
ApiError Provider returned non-2xx
NoSpeechGeneratedError Empty input (after tag stripping) or empty provider response
StreamingNotSupportedError streamSpeech() on a non-streaming model
VolumeAdjustmentUnsupportedError volumeDbfs with no decodable output mode
TimestampKeyMissingError timestamps: "on" fallback key missing
ConversationInputError / DialogueConstraintError / StitchUnsupportedError generateConversation validation / native caps / stitch incompatibility
SpeechSDKError Base class

Retries 5xx and network errors with exponential backoff (p-retry); does not retry 4xx. Default 2 retries; override via maxRetries.

Development

pnpm install
pnpm test              # unit tests
pnpm run test:e2e      # e2e tests (requires provider API keys)
pnpm run typecheck
pnpm fix               # format + lint

E2E tests hit real provider APIs. Set the relevant keys in .env or export them. Set SPEECH_SDK_E2E_OUTPUT_DIR=~/Downloads/convos to write conversation e2e audio to disk.

About

Universal Text-To-Speech TypeScript SDK with Multi-Provider Support.

Topics

Resources

License

Stars

Watchers

Forks

Packages

Contributors