GPU-accelerated Text-to-Speech and Speech-to-Text template for GPU CLI.
Run state-of-the-art TTS with voice cloning and STT transcription on cloud GPUs from your Mac.
| Service | Model | License | Why |
|---|---|---|---|
| TTS | Chatterbox by Resemble AI | MIT | Voice cloning from ~5s of audio, emotion control, 23 languages |
| STT | faster-whisper | MIT | 4x faster than Whisper, same accuracy, INT8/FP16 quantization |
Both are fully permissive (MIT) for commercial use.
# Start combined TTS + STT server on a cloud GPU
gpu run
# Or run just the TTS server
gpu run python tts_server.py
# Or run just the STT server
gpu run python stt_server.pyThe server starts on port 8000 (forwarded to your machine via gpu).
Open http://localhost:8000 for the web UI, or use the CLI below.
The web frontend is served at / and provides:
- Text input with synthesize button (TTS)
- Microphone recording or file upload (STT)
- Voice reference upload for cloning
- Exaggeration and CFG weight sliders
- Interaction history with audio playback
# Interactive REPL (type text for TTS, /stt to record)
python cli.py
# One-shot TTS
python cli.py tts "Hello world" --voice samples/ref.wav -o output.wav
# One-shot STT (from file)
python cli.py stt recording.wav
# One-shot STT (record from mic, needs sox)
python cli.py stt --duration 5
# Health check
python cli.py health# Basic synthesis
curl -X POST http://localhost:8000/tts/tts \
-F "text=Hello, this is a test." \
-o output.wav
# With voice cloning (provide a 5-10s reference WAV)
curl -X POST http://localhost:8000/tts/tts \
-F "text=Hello, this is a test." \
-F "voice=@samples/reference.wav" \
-F "exaggeration=0.6" \
-F "cfg_weight=0.5" \
-o output.wavcurl -X POST http://localhost:8000/stt/stt \
-F "audio=@recording.wav"Returns:
{
"language": "en",
"language_probability": 0.98,
"text": "The transcribed text appears here.",
"segments": [
{"start": 0.0, "end": 2.5, "text": "The transcribed text"},
{"start": 2.5, "end": 4.1, "text": "appears here."}
]
}All settings are configurable via environment variables. Set them in your shell or pass them through the startup command in gpu.jsonc.
| Variable | Default | Description |
|---|---|---|
TTS_MODEL |
chatterbox |
chatterbox (standard) or chatterbox-turbo (faster, lower latency) |
TTS_EXAGGERATION |
0.5 |
Emotion intensity: 0 = monotone, 1 = very expressive |
TTS_CFG_WEIGHT |
0.5 |
Voice cloning fidelity (higher = closer to reference) |
STT_MODEL |
large-v3-turbo |
Whisper model size: tiny, base, small, medium, large-v3, large-v3-turbo |
STT_COMPUTE_TYPE |
float16 |
float16, int8, or float32 |
STT_BEAM_SIZE |
5 |
Beam size for decoding (higher = more accurate, slower) |
PORT |
8000 |
Server port |
gpu run TTS_MODEL=chatterbox-turbo python server.pygpu-tts/
gpu.jsonc # GPU CLI config (GPU selection, environment, ports)
config.py # All configurable settings (env vars)
tts_server.py # TTS API (Chatterbox)
stt_server.py # STT API (faster-whisper)
server.py # Combined TTS + STT server (serves web UI)
cli.py # CLI for testing TTS/STT from terminal
index.html # Web UI frontend
requirements.txt # Python dependencies
samples/ # Drop voice reference WAVs here
- Place a 5-10 second WAV of the target voice in
samples/ - Pass it as the
voicefield when calling/tts/tts - Adjust
cfg_weight(0-1) to control how closely the output matches the reference - Adjust
exaggeration(0-1) to control emotional expressiveness
- GPU: RTX 4090 (24GB VRAM) at ~$0.44/hr
- Both Chatterbox (~8GB) and faster-whisper large-v3-turbo (~6GB) fit comfortably
- Auto-stops after 10 minutes idle
If Chatterbox doesn't fit your needs, these also have permissive licenses:
| Model | License | Notes |
|---|---|---|
| Kokoro (82M) | Apache 2.0 | Extremely lightweight, preset voices only (no cloning), 9 languages |
| Dia2 (2B) | Apache 2.0 | Multi-speaker dialogue, streaming, great for conversational AI |
| MeloTTS | MIT | CPU-optimized, no voice cloning, 6 languages |
Models to avoid (restrictive licenses): XTTS/Coqui (non-commercial), Fish Speech (weights are CC-BY-NC), MARS5 (AGPL), Piper (GPL).
MIT