An AI-powered pipeline for processing audio files to extract dialogue, perform speaker diarization, character attribution, and generate summaries.
Created to extract dialog and summarize Dungeons & Dragons gameplay sessions, but applicable to similarly-themed workflows.
- Audio Transcription: Convert audio files to text using Whisper
- Speaker Diarization: Identify different speakers in audio
- Dialogue Alignment: Merge transcripts with speaker information
- Character Attribution: Map dialogue lines to characters using LLM
- Summarization: Generate scene summaries and beat sheets
- Vector Search: Index and query processed content
├─ README.md
├─ .env # tokens + config
├─ requirements.txt
├─ data/
│ ├─ audio/ # drop WAV/MP3 here
│ ├─ transcripts/ # whisper JSON + TXT
│ ├─ diarization/ # speaker turns (RTTM/JSON)
│ ├─ aligned/ # transcript merged with speakers
│ ├─ attributed/ # character-attributed dialogue
│ └─ summaries/ # scene summaries/beat sheets
├─ chroma/ # vector store
├─ app/
│ ├─ cli.py # Typer CLI entrypoint
│ ├─ asr_whisper.py # transcription
│ ├─ diarize.py # speaker diarization
│ ├─ align.py # align ASR segments ↔ speakers
│ ├─ attribute.py # map lines to Characters via LLM
│ ├─ summarize.py # scene/episode summaries
│ ├─ embed_index.py # Chroma ingest + query
│ ├─ prompts.py # prompt templates
│ └─ utils.py # ffmpeg, io helpers, chunking
-
FFmpeg (required for audio processing):
# Ubuntu/Debian sudo apt install ffmpeg # macOS brew install ffmpeg
-
Ollama (required for LLM inference):
# Install Ollama curl -fsSL https://ollama.ai/install.sh | sh # Pull the model (20B parameter model recommended) ollama pull gpt-oss:20b
-
Hugging Face Token (required for speaker diarization):
- Create account at https://huggingface.co
- Get a read-only API token from your settings
- Accept the license for
pyannote/speaker-diarization-3.1
-
Establish a virtual environment:
python3 -m venv ~/starfire_venv source ~/starfire_venv/bin/activate pip install -r requirements.txt
-
Copy
.env-defaultinto a new file,.env, then configure environment variables:cp .env-default .env # Edit .env and set your HF_TOKEN -
Place audio files in
data/audio/
If you have MP3 files, convert them to the required WAV format:
# Batch convert all MP3 files to WAV (16kHz mono)
for f in data/audio/*.mp3; do
ffmpeg -i "$f" -ac 1 -ar 16000 -c:a pcm_s16le "${f%.mp3}.wav"
doneRun the CLI tool:
python app/cli.py --helpProcess a D&D session from start to finish:
# 1. Transcribe audio to text
python -m app.cli transcribe data/audio/Session_090123_01.wav
# 2. Identify speakers in the audio
python -m app.cli diarize data/audio/Session_090123_01.wav
# 3. Align transcript with speaker information
python -m app.cli align Session01Create a roster.json file to map speakers to characters:
{
"dm": "Luke (DM)",
"players": [
{"name": "Jerome", "character": "Aguiar", "notes": "Human fighter"},
{"name": "Nancy", "character": "Juniper", "notes": "Elf magic user"},
{"name": "Chris", "character": "Starble", "notes": "Dwarf fighter"}
],
"known_npcs": [
{"name":"Glade", "notes":"Member of the party but controlled by the DM"},
{"name":"Starla", "notes":"High charisma, older human female, innate storyteller character, lives outside of Drexville"}
],
"tone":"Low magic, survival-focused campaign"
}# 4. Attribute dialogue lines to characters
python -m app.cli attribute Session01 roster.json
# 5. Generate scene summaries
python -m app.cli summarize Session01
# 6. Index content for vector search
python -m app.cli index Session01