Cloud-Powered Dictation: Fast STT for Old Linux Hardware with Copilot & Claude

Context

I work daily with GitHub Copilot and Claude Code, and like many developers, I often think faster than I type. Speech‑to‑Text (STT) is an obvious productivity multiplier — but there’s a catch:

Unlike traditional local STT solutions that struggle on weak hardware, this approach leverages cloud processing for near-instant results.

  • My main Linux laptop is old (Dell Vostro 14‑3468)
  • CPU is weak, no GPU acceleration
  • I want speed and accuracy, not fancy UI
  • I want it to work everywhere, not inside a specific app

This post documents a minimal, fast, and reliable STT setup that works perfectly on old hardware and integrates seamlessly into a modern coding workflow.


Design Goals

Before touching any tools, I defined strict constraints:

  • ⚡ Low latency (sub‑second for short dictations)
  • 🧠 High accuracy (English & Spanish)
  • 🖥️ Works on old CPUs
  • ⌨️ Keyboard‑driven (no mouse, no UI)
  • 🌍 Global (works in Copilot, Claude Code, editors, terminals)
  • 🧩 Wayland‑compatible (no X11 hacks)

Offline STT was nice to have, but not required.


Why Local Whisper GUIs Didn’t Work

I tested several local Whisper-based tools (GUI and CLI wrappers). They worked, but:

  • ❌ Slow on CPU‑only machines
  • ❌ Heavy UIs
  • ❌ Poor integration with keyboard workflows

On old hardware, local inference is the bottleneck, not disk or RAM.

So instead of fighting physics, I changed the architecture.


Architecture: Thin Client + API STT

The winning approach:

Mic → WAV → OpenAI STT API → Clipboard → Paste Anywhere

Key idea:

Let the laptop do only audio capture. Let the cloud do inference.

This gives:

  • Near‑instant transcription for short prompts
  • Consistent accuracy
  • Minimal local resource usage

Core Tools

  • Ubuntu Linux (Wayland)
  • sox (recplay) for audio capture & beeps
  • OpenAI STT API (gpt-4o-mini-transcribe)
  • wl-copy for clipboard integration
  • Keyboard shortcuts (system‑level)

No window focus tricks. No UI automation. No hacks.


The Dictation Script

This Python script does exactly one thing:

  • Record audio until I stop it
  • Transcribe it
  • Put the text in the clipboard
  • Signal readiness with a sound
import subprocess
import tempfile
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

def beep(freq, dur=0.12):
    subprocess.run(
        ["play", "-nq", "synth", str(dur), "sine", str(freq)],
        check=False,
    )

with tempfile.NamedTemporaryFile(suffix=".wav") as f:
    beep(1200)  # start recording

    try:
        subprocess.run(
            ["rec", "-q", f.name, "rate", "16000", "channels", "1"],
            check=True,
        )
    except KeyboardInterrupt:
        pass

    beep(600)  # stop recording

    with open(f.name, "rb") as audio:
        r = client.audio.transcriptions.create(
            model="gpt-4o-mini-transcribe",
            file=audio,
        )
        text = r.text.strip()

subprocess.run(
    ["wl-copy"],
    input=text,
    text=True,
    check=True,
)

beep(1500)  # clipboard ready

Push‑to‑Talk With Two Shortcuts

To make this frictionless, I use two global keyboard shortcuts:

▶ Start Dictation

#!/usr/bin/env bash
source ~/tools/stt-dictation/.venv/bin/activate
python ~/tools/stt-dictation/dictate.py

Shortcut example:

  • Super + D

⏹ Stop Dictation

#!/usr/bin/env bash
pkill -INT rec

Shortcut example:

  • Super + S

This cleanly stops recording without killing Python, allowing transcription to finish.


Audio Feedback (Why It Matters)

I intentionally avoided notifications or UI.

Instead, sound cues communicate state instantly:

  • 🔊 High beep → recording started
  • 🔊 Low beep → recording stopped
  • 🔔 Single high beep → text ready to paste

This avoids a classic failure mode:

Pasting before transcription finishes.

With the final beep, I know exactly when Ctrl+V is safe.

Sound cues work particularly well for coding workflows because they don’t require visual attention or interrupt your focus on the screen.


Performance & Cost

For short prompts (< 5 seconds):

  • ⏱️ End‑to‑end latency: ~1 second
  • 💲 Cost per dictation: ~$0.0005 (at $0.006 per minute for gpt-4o-mini-transcribe, a 5-second dictation costs ~$0.0005)

Even with heavy daily usage, monthly cost is negligible.

At this scale, latency matters far more than cost.


Why This Works Well for Copilot & Claude Code

  • Global clipboard → works everywhere
  • No dependency on editor plugins
  • Perfect for:
    • Long prompts
    • Refactoring instructions
    • Natural language problem descriptions

It feels like talking to the IDE, not typing into it.


Lessons Learned

  1. Old hardware is fine if you keep it thin
  2. Cloud inference beats local CPU every time
  3. Keyboard‑first workflows matter
  4. Audio feedback > visual feedback
  5. Simple pipelines beat complex tools

Final Thoughts

This setup turned an aging laptop into a high‑quality dictation workstation for modern AI‑assisted development.

No heavyweight apps. No vendor lock‑in. No UI friction.

Just:

Think → Speak → Paste → Code

Key Takeaways:

  • Cloud processing beats local CPU for speed on old hardware
  • Keyboard-driven workflows integrate seamlessly with coding
  • Audio feedback provides instant state awareness
  • Minimalist design ensures reliability and low maintenance

If you’re working on Linux with limited hardware and heavy AI usage, this approach is worth copying.

Share