transcribe.dev - Voice Dictation Platform

Project Overview

transcribe.dev is a system-wide voice-to-text dictation tool (similar to Wispr Flow) that transcribes speech in real-time, then uses AI to clean, format, and adapt tone based on context. Works in any application.

Tech Stack

Desktop App: Electron + Next.js + TypeScript
Backend: Convex (real-time database, actions, file storage)
Voice Capture: Web Audio API
STT Engine: Deepgram (primary cloud) / whisper.cpp (local/privacy mode)
AI Cleanup: Claude Haiku API (via Convex actions)
Text Injection: robotjs or nut.js (cross-platform keystroke simulation)
Auth: WorkOS AuthKit
Mobile (future): React Native

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        ELECTRON APP                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │ Voice Input │→ │ Deepgram WS │→ │ Convex Action (cleanup) │  │
│  │ (mic)       │  │ (streaming) │  │ Claude Haiku            │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
│         │                                      │                 │
│         ▼                                      ▼                 │
│  ┌─────────────┐                    ┌─────────────────────────┐  │
│  │ System Tray │                    │ Text Injection (nut.js) │  │
│  │ + Hotkeys   │                    │ → Active Application    │  │
│  └─────────────┘                    └─────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         CONVEX BACKEND                           │
│  Tables: users, dictionaries, snippets, appTones, history        │
│  Actions: transcribeAudio, cleanupTranscript                     │
│  Real-time sync across all user devices                          │
└─────────────────────────────────────────────────────────────────┘

Project Structure

transcribe-dev/
├── apps/
│   └── desktop/                 # Electron + Next.js app
│       ├── main/                # Electron main process
│       │   ├── index.ts         # Main entry, window management
│       │   ├── tray.ts          # System tray
│       │   ├── hotkeys.ts       # Global hotkey registration
│       │   ├── audio.ts         # Audio capture
│       │   └── injection.ts     # Text injection via nut.js
│       ├── renderer/            # Next.js renderer
│       │   ├── app/
│       │   │   ├── page.tsx     # Main dictation UI
│       │   │   ├── settings/    # Settings pages
│       │   │   └── layout.tsx
│       │   ├── components/
│       │   │   ├── Waveform.tsx
│       │   │   ├── DictationOverlay.tsx
│       │   │   └── SettingsPanel.tsx
│       │   └── hooks/
│       │       ├── useVoiceCapture.ts
│       │       ├── useDeepgram.ts
│       │       └── useTextInjection.ts
│       ├── electron-builder.json
│       └── package.json
├── convex/                      # Convex backend
│   ├── schema.ts                # Database schema
│   ├── users.ts                 # User queries/mutations
│   ├── dictionaries.ts          # Personal dictionary
│   ├── snippets.ts              # Voice shortcuts
│   ├── appTones.ts              # Per-app tone settings
│   ├── history.ts               # Dictation history
│   └── voice.ts                 # Actions: transcribe, cleanup
├── packages/
│   └── shared/                  # Shared types and utilities
│       ├── types.ts
│       └── constants.ts
├── package.json                 # Monorepo root
├── turbo.json                   # Turborepo config
└── README.md

Convex Schema

// convex/schema.ts
import { defineSchema, defineTable } from "convex/server";
import { v } from "convex/values";

export default defineSchema({
  users: defineTable({
    authId: v.string(),
    email: v.string(),
    plan: v.union(v.literal("free"), v.literal("pro"), v.literal("team")),
    settings: v.object({
      defaultLanguage: v.string(),
      localModeEnabled: v.boolean(),
      autoCapitalize: v.boolean(),
      autoPunctuation: v.boolean(),
      holdToTalk: v.boolean(),
      hotkey: v.string(),
    }),
    createdAt: v.number(),
  }).index("by_auth", ["authId"]),

  dictionaries: defineTable({
    userId: v.id("users"),
    word: v.string(),
    pronunciation: v.optional(v.string()),
    category: v.optional(v.string()),
  }).index("by_user", ["userId"]),

  snippets: defineTable({
    userId: v.id("users"),
    trigger: v.string(),
    expansion: v.string(),
    isEnabled: v.boolean(),
  }).index("by_user", ["userId"]),

  appTones: defineTable({
    userId: v.id("users"),
    appIdentifier: v.string(),
    tone: v.union(
      v.literal("casual"),
      v.literal("professional"),
      v.literal("technical"),
      v.literal("friendly")
    ),
    customInstructions: v.optional(v.string()),
  }).index("by_user_app", ["userId", "appIdentifier"]),

  history: defineTable({
    userId: v.id("users"),
    rawTranscript: v.string(),
    cleanedText: v.string(),
    targetApp: v.optional(v.string()),
    durationMs: v.number(),
    createdAt: v.number(),
  }).index("by_user_time", ["userId", "createdAt"]),
});

Key Convex Actions

// convex/voice.ts
import { action } from "./_generated/server";
import { v } from "convex/values";
import Anthropic from "@anthropic-ai/sdk";

export const cleanupTranscript = action({
  args: {
    rawTranscript: v.string(),
    tone: v.string(),
    customInstructions: v.optional(v.string()),
    userDictionary: v.array(v.string()),
  },
  handler: async (ctx, args) => {
    const anthropic = new Anthropic();

    const systemPrompt = `You are a dictation cleanup assistant. Transform spoken text into clean, written text.
    
Rules:
- Remove filler words (um, uh, like, you know)
- Fix grammar and punctuation
- Maintain the speaker's voice and intent
- Tone: ${args.tone}
- Known words/names to preserve exactly: ${args.userDictionary.join(", ")}
${args.customInstructions ? `- Additional instructions: ${args.customInstructions}` : ""}

Return ONLY the cleaned text, nothing else.`;

    const response = await anthropic.messages.create({
      model: "claude-3-haiku-20240307",
      max_tokens: 1024,
      system: systemPrompt,
      messages: [{ role: "user", content: args.rawTranscript }],
    });

    return response.content[0].type === "text" 
      ? response.content[0].text 
      : args.rawTranscript;
  },
});

export const transcribeAudio = action({
  args: {
    audioBase64: v.string(),
    language: v.optional(v.string()),
  },
  handler: async (ctx, args) => {
    const response = await fetch(
      "https://api.deepgram.com/v1/listen?model=nova-2&smart_format=true",
      {
        method: "POST",
        headers: {
          Authorization: `Token ${process.env.DEEPGRAM_API_KEY}`,
          "Content-Type": "audio/webm",
        },
        body: Buffer.from(args.audioBase64, "base64"),
      }
    );

    const result = await response.json();
    return result.results?.channels[0]?.alternatives[0]?.transcript ?? "";
  },
});

Electron Main Process

// apps/desktop/main/index.ts
import { app, BrowserWindow, globalShortcut, ipcMain } from "electron";
import { createTray } from "./tray";
import { registerHotkeys } from "./hotkeys";
import { AudioCapture } from "./audio";
import { TextInjector } from "./injection";

let mainWindow: BrowserWindow | null = null;
let audioCapture: AudioCapture;
let textInjector: TextInjector;

async function createWindow() {
  mainWindow = new BrowserWindow({
    width: 400,
    height: 600,
    frame: false,
    transparent: true,
    alwaysOnTop: true,
    webPreferences: {
      nodeIntegration: false,
      contextIsolation: true,
      preload: path.join(__dirname, "preload.js"),
    },
  });

  // Load Next.js app
  if (process.env.NODE_ENV === "development") {
    mainWindow.loadURL("http://localhost:3000");
  } else {
    mainWindow.loadFile("renderer/out/index.html");
  }
}

app.whenReady().then(async () => {
  await createWindow();
  createTray(mainWindow);
  
  audioCapture = new AudioCapture();
  textInjector = new TextInjector();
  
  registerHotkeys({
    onStartRecording: () => audioCapture.start(),
    onStopRecording: () => audioCapture.stop(),
  });

  // IPC handlers
  ipcMain.handle("inject-text", async (_, text: string) => {
    await textInjector.inject(text);
  });

  ipcMain.handle("get-active-app", async () => {
    return textInjector.getActiveApp();
  });
});

Deepgram Streaming Hook

// apps/desktop/renderer/hooks/useDeepgram.ts
import { useCallback, useRef, useState } from "react";

interface UseDeepgramOptions {
  onTranscript: (text: string, isFinal: boolean) => void;
  onError: (error: Error) => void;
}

export function useDeepgram({ onTranscript, onError }: UseDeepgramOptions) {
  const wsRef = useRef<WebSocket | null>(null);
  const [isConnected, setIsConnected] = useState(false);

  const connect = useCallback(async () => {
    const ws = new WebSocket(
      "wss://api.deepgram.com/v1/listen?model=nova-2&smart_format=true&interim_results=true",
      ["token", process.env.NEXT_PUBLIC_DEEPGRAM_API_KEY!]
    );

    ws.onopen = () => setIsConnected(true);
    ws.onclose = () => setIsConnected(false);
    ws.onerror = (e) => onError(new Error("WebSocket error"));

    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      const transcript = data.channel?.alternatives?.[0]?.transcript;
      const isFinal = data.is_final;
      
      if (transcript) {
        onTranscript(transcript, isFinal);
      }
    };

    wsRef.current = ws;
  }, [onTranscript, onError]);

  const sendAudio = useCallback((audioData: ArrayBuffer) => {
    if (wsRef.current?.readyState === WebSocket.OPEN) {
      wsRef.current.send(audioData);
    }
  }, []);

  const disconnect = useCallback(() => {
    wsRef.current?.close();
    wsRef.current = null;
  }, []);

  return { connect, sendAudio, disconnect, isConnected };
}

Environment Variables

# .env.local (Convex)
CONVEX_DEPLOYMENT=your-deployment-name
WORKOS_CLIENT_ID=your-workos-client-id
DEEPGRAM_API_KEY=your-deepgram-key
ANTHROPIC_API_KEY=your-anthropic-key

# .env.local (Web App)
NEXT_PUBLIC_CONVEX_URL=https://your-deployment.convex.cloud
WORKOS_CLIENT_ID=your-workos-client-id
WORKOS_API_KEY=your-workos-api-key
WORKOS_COOKIE_PASSWORD=your-32-char-random-string
NEXT_PUBLIC_WORKOS_REDIRECT_URI=http://localhost:3012/callback
NEXT_PUBLIC_DEEPGRAM_API_KEY=your-deepgram-key

Getting Started Commands

# 1. Create the monorepo
mkdir transcribe-dev && cd transcribe-dev
pnpm init

# 2. Set up Turborepo
pnpm add -D turbo
echo '{"$schema": "https://turbo.build/schema.json", "tasks": {"build": {}, "dev": {"cache": false}}}' > turbo.json

# 3. Initialize Convex
pnpm create convex@latest
cd convex && pnpm install && cd ..

# 4. Create Electron app with Next.js
mkdir -p apps/desktop
cd apps/desktop
pnpm create next-app@latest renderer --typescript --tailwind --eslint --app --src-dir=false
pnpm add electron electron-builder
pnpm add -D concurrently wait-on

# 5. Install key dependencies
pnpm add convex @workos-inc/authkit-nextjs @anthropic-ai/sdk
pnpm add nut-js  # For text injection

# 6. Run development
pnpm dev

MVP Feature Checklist

Phase 1: Core Dictation

Electron app with system tray
Global hotkey (Cmd+Shift+Space) to start/stop
Deepgram streaming transcription
Basic text injection into active app
Minimal floating UI showing waveform

Phase 2: AI Cleanup

Claude Haiku cleanup via Convex action
Basic filler word removal
Grammar correction
Punctuation

Phase 3: Personalization

Personal dictionary (sync via Convex)
Snippets/voice shortcuts
Per-app tone settings
Settings UI

Phase 4: Polish

API Keys Needed

Deepgram - https://console.deepgram.com (free tier: $200 credit)
Anthropic - https://console.anthropic.com (for Claude Haiku cleanup)
WorkOS - https://dashboard.workos.com (auth)
Convex - https://dashboard.convex.dev (free tier generous)

Notes

Start with cloud-only (Deepgram) for MVP, add local Whisper later for privacy mode
Use nut-js over robotjs - better maintained, TypeScript support
Electron's globalShortcut for hotkeys, not browser shortcuts
Consider electron-store for local preferences that don't need sync