Skip to content

Repository files navigation

gpu-tts

GPU-accelerated Text-to-Speech and Speech-to-Text template for GPU CLI.

Run state-of-the-art TTS with voice cloning and STT transcription on cloud GPUs from your Mac.

Models

Service Model License Why
TTS Chatterbox by Resemble AI MIT Voice cloning from ~5s of audio, emotion control, 23 languages
STT faster-whisper MIT 4x faster than Whisper, same accuracy, INT8/FP16 quantization

Both are fully permissive (MIT) for commercial use.

Quick Start

# Start combined TTS + STT server on a cloud GPU
gpu run

# Or run just the TTS server
gpu run python tts_server.py

# Or run just the STT server
gpu run python stt_server.py

The server starts on port 8000 (forwarded to your machine via gpu).

Open http://localhost:8000 for the web UI, or use the CLI below.

Web UI

The web frontend is served at / and provides:

  • Text input with synthesize button (TTS)
  • Microphone recording or file upload (STT)
  • Voice reference upload for cloning
  • Exaggeration and CFG weight sliders
  • Interaction history with audio playback

CLI

# Interactive REPL (type text for TTS, /stt to record)
python cli.py

# One-shot TTS
python cli.py tts "Hello world" --voice samples/ref.wav -o output.wav

# One-shot STT (from file)
python cli.py stt recording.wav

# One-shot STT (record from mic, needs sox)
python cli.py stt --duration 5

# Health check
python cli.py health

API

Text-to-Speech

# Basic synthesis
curl -X POST http://localhost:8000/tts/tts \
  -F "text=Hello, this is a test." \
  -o output.wav

# With voice cloning (provide a 5-10s reference WAV)
curl -X POST http://localhost:8000/tts/tts \
  -F "text=Hello, this is a test." \
  -F "voice=@samples/reference.wav" \
  -F "exaggeration=0.6" \
  -F "cfg_weight=0.5" \
  -o output.wav

Speech-to-Text

curl -X POST http://localhost:8000/stt/stt \
  -F "audio=@recording.wav"

Returns:

{
  "language": "en",
  "language_probability": 0.98,
  "text": "The transcribed text appears here.",
  "segments": [
    {"start": 0.0, "end": 2.5, "text": "The transcribed text"},
    {"start": 2.5, "end": 4.1, "text": "appears here."}
  ]
}

Configuration

All settings are configurable via environment variables. Set them in your shell or pass them through the startup command in gpu.jsonc.

Variable Default Description
TTS_MODEL chatterbox chatterbox (standard) or chatterbox-turbo (faster, lower latency)
TTS_EXAGGERATION 0.5 Emotion intensity: 0 = monotone, 1 = very expressive
TTS_CFG_WEIGHT 0.5 Voice cloning fidelity (higher = closer to reference)
STT_MODEL large-v3-turbo Whisper model size: tiny, base, small, medium, large-v3, large-v3-turbo
STT_COMPUTE_TYPE float16 float16, int8, or float32
STT_BEAM_SIZE 5 Beam size for decoding (higher = more accurate, slower)
PORT 8000 Server port

Example: Use Turbo TTS

gpu run TTS_MODEL=chatterbox-turbo python server.py

Project Structure

gpu-tts/
  gpu.jsonc          # GPU CLI config (GPU selection, environment, ports)
  config.py          # All configurable settings (env vars)
  tts_server.py      # TTS API (Chatterbox)
  stt_server.py      # STT API (faster-whisper)
  server.py          # Combined TTS + STT server (serves web UI)
  cli.py             # CLI for testing TTS/STT from terminal
  index.html         # Web UI frontend
  requirements.txt   # Python dependencies
  samples/           # Drop voice reference WAVs here

Voice Cloning

  1. Place a 5-10 second WAV of the target voice in samples/
  2. Pass it as the voice field when calling /tts/tts
  3. Adjust cfg_weight (0-1) to control how closely the output matches the reference
  4. Adjust exaggeration (0-1) to control emotional expressiveness

GPU & Cost

  • GPU: RTX 4090 (24GB VRAM) at ~$0.44/hr
  • Both Chatterbox (~8GB) and faster-whisper large-v3-turbo (~6GB) fit comfortably
  • Auto-stops after 10 minutes idle

Alternative TTS Models

If Chatterbox doesn't fit your needs, these also have permissive licenses:

Model License Notes
Kokoro (82M) Apache 2.0 Extremely lightweight, preset voices only (no cloning), 9 languages
Dia2 (2B) Apache 2.0 Multi-speaker dialogue, streaming, great for conversational AI
MeloTTS MIT CPU-optimized, no voice cloning, 6 languages

Models to avoid (restrictive licenses): XTTS/Coqui (non-commercial), Fish Speech (weights are CC-BY-NC), MARS5 (AGPL), Piper (GPL).

License

MIT

About

GPU TTS example project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors