gpu-tts

GPU-accelerated Text-to-Speech and Speech-to-Text template for GPU CLI.

Run state-of-the-art TTS with voice cloning and STT transcription on cloud GPUs from your Mac.

Models

Service	Model	License	Why
TTS	Chatterbox by Resemble AI	MIT	Voice cloning from ~5s of audio, emotion control, 23 languages
STT	faster-whisper	MIT	4x faster than Whisper, same accuracy, INT8/FP16 quantization

Both are fully permissive (MIT) for commercial use.

Quick Start

# Start combined TTS + STT server on a cloud GPU
gpu run

# Or run just the TTS server
gpu run python tts_server.py

# Or run just the STT server
gpu run python stt_server.py

The server starts on port 8000 (forwarded to your machine via gpu).

Open http://localhost:8000 for the web UI, or use the CLI below.

Web UI

The web frontend is served at / and provides:

Text input with synthesize button (TTS)
Microphone recording or file upload (STT)
Voice reference upload for cloning
Exaggeration and CFG weight sliders
Interaction history with audio playback

CLI

# Interactive REPL (type text for TTS, /stt to record)
python cli.py

# One-shot TTS
python cli.py tts "Hello world" --voice samples/ref.wav -o output.wav

# One-shot STT (from file)
python cli.py stt recording.wav

# One-shot STT (record from mic, needs sox)
python cli.py stt --duration 5

# Health check
python cli.py health

API

Text-to-Speech

# Basic synthesis
curl -X POST http://localhost:8000/tts/tts \
  -F "text=Hello, this is a test." \
  -o output.wav

# With voice cloning (provide a 5-10s reference WAV)
curl -X POST http://localhost:8000/tts/tts \
  -F "text=Hello, this is a test." \
  -F "voice=@samples/reference.wav" \
  -F "exaggeration=0.6" \
  -F "cfg_weight=0.5" \
  -o output.wav

Speech-to-Text

curl -X POST http://localhost:8000/stt/stt \
  -F "audio=@recording.wav"

Returns:

{
  "language": "en",
  "language_probability": 0.98,
  "text": "The transcribed text appears here.",
  "segments": [
    {"start": 0.0, "end": 2.5, "text": "The transcribed text"},
    {"start": 2.5, "end": 4.1, "text": "appears here."}
  ]
}

Configuration

All settings are configurable via environment variables. Set them in your shell or pass them through the startup command in gpu.jsonc.

Variable	Default	Description
`TTS_MODEL`	`chatterbox`	`chatterbox` (standard) or `chatterbox-turbo` (faster, lower latency)
`TTS_EXAGGERATION`	`0.5`	Emotion intensity: 0 = monotone, 1 = very expressive
`TTS_CFG_WEIGHT`	`0.5`	Voice cloning fidelity (higher = closer to reference)
`STT_MODEL`	`large-v3-turbo`	Whisper model size: `tiny`, `base`, `small`, `medium`, `large-v3`, `large-v3-turbo`
`STT_COMPUTE_TYPE`	`float16`	`float16`, `int8`, or `float32`
`STT_BEAM_SIZE`	`5`	Beam size for decoding (higher = more accurate, slower)
`PORT`	`8000`	Server port

Example: Use Turbo TTS

gpu run TTS_MODEL=chatterbox-turbo python server.py

Project Structure

gpu-tts/
  gpu.jsonc          # GPU CLI config (GPU selection, environment, ports)
  config.py          # All configurable settings (env vars)
  tts_server.py      # TTS API (Chatterbox)
  stt_server.py      # STT API (faster-whisper)
  server.py          # Combined TTS + STT server (serves web UI)
  cli.py             # CLI for testing TTS/STT from terminal
  index.html         # Web UI frontend
  requirements.txt   # Python dependencies
  samples/           # Drop voice reference WAVs here

Voice Cloning

Place a 5-10 second WAV of the target voice in samples/
Pass it as the voice field when calling /tts/tts
Adjust cfg_weight (0-1) to control how closely the output matches the reference
Adjust exaggeration (0-1) to control emotional expressiveness

GPU & Cost

GPU: RTX 4090 (24GB VRAM) at ~$0.44/hr
Both Chatterbox (~8GB) and faster-whisper large-v3-turbo (~6GB) fit comfortably
Auto-stops after 10 minutes idle

Alternative TTS Models

If Chatterbox doesn't fit your needs, these also have permissive licenses:

Model	License	Notes
Kokoro (82M)	Apache 2.0	Extremely lightweight, preset voices only (no cloning), 9 languages
Dia2 (2B)	Apache 2.0	Multi-speaker dialogue, streaming, great for conversational AI
MeloTTS	MIT	CPU-optimized, no voice cloning, 6 languages

Models to avoid (restrictive licenses): XTTS/Coqui (non-commercial), Fish Speech (weights are CC-BY-NC), MARS5 (AGPL), Piper (GPL).

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-tts

Models

Quick Start

Web UI

CLI

API

Text-to-Speech

Speech-to-Text

Configuration

Example: Use Turbo TTS

Project Structure

Voice Cloning

GPU & Cost

Alternative TTS Models

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
BUGS.md		BUGS.md
README.md		README.md
cli.py		cli.py
config.py		config.py
gpu.jsonc		gpu.jsonc
index.html		index.html
requirements.txt		requirements.txt
server.py		server.py
stt_server.py		stt_server.py
tts_server.py		tts_server.py

Folders and files

Latest commit

History

Repository files navigation

gpu-tts

Models

Quick Start

Web UI

CLI

API

Text-to-Speech

Speech-to-Text

Configuration

Example: Use Turbo TTS

Project Structure

Voice Cloning

GPU & Cost

Alternative TTS Models

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages