Internal DocumentationArchived DocumentationTranscribe devFutureVoice synthesis

Voice Synthesis API (Phase 2 - Future)

Status: This is Phase 2 of the talk.dev product roadmap. Development will begin after Phase 1 (Desktop Dictation App) is complete.

Overview

The Voice Synthesis API positions talk.dev as a direct competitor to ElevenLabs in the text-to-speech (TTS) voice AI market. This phase will deliver a developer-first API platform for voice synthesis with superior performance, pricing, and developer experience.

Product Direction

Text → Speech (TTS)

  • Transform text into natural speech
  • Voice cloning from short audio samples
  • Real-time streaming synthesis
  • 25+ languages with emotional expression

Key Differentiators vs ElevenLabs

FeatureElevenLabstalk.dev Target
Synthesis Speed200ms+<150ms
Pricing$0.004/1K chars$0.002/1K chars
Voice Cloning5-15 minutes2-5 minutes
API DesignComplexRESTful + WebSocket

Documentation in This Folder

Prerequisites

Phase 2 development should begin after:

  1. Phase 1 Desktop Dictation App is feature-complete
  2. Convex backend is proven and stable
  3. User base established through dictation app
  4. Revenue stream from dictation subscriptions

Tech Stack (Planned)

  • Voice Engine: Transformer-based neural vocoder
  • Inference: ONNX Runtime on GPU clusters
  • API: FastAPI (Python) with async/await
  • Streaming: WebSocket for real-time synthesis
  • Infrastructure: Kubernetes with auto-scaling GPU nodes

Timeline

Development timeline TBD based on Phase 1 completion. Estimated 6-12 months for MVP after Phase 1 launch.


Current Priority: See talk-dev-bootstrap-PRD.md for Phase 1 Desktop Dictation App.

On this page