---
title: "Transcription Tools"
description: "Transcribe audio and video with AI — record, transcribe, summarize"
tags: ["media", "transcription", "pipeline"]
difficulty: beginner
time_estimate: "90 min"
project_directions: ["content-pipeline"]
---
# Transcription Tools
Turning spoken words into text is one of the most practical AI capabilities. Lectures, interviews, meetings, podcasts — anything someone said can become text you can search, quote, summarize, and feed to another AI.
This quest walks you through building a transcription pipeline from the terminal with Claude Code, then stress-testing Whisper to learn what it's actually good at.
## Prerequisites
- Completed Module 2.1 (The Terminal) — you'll use `yt-dlp` and `ffmpeg`
- Completed Module 2.2 (Claude Code)
- An OpenAI API key (the same one from Unit 1 works)
---
## Read the Docs First
Before writing any code, read OpenAI's guide on speech-to-text:
**https://developers.openai.com/api/docs/guides/speech-to-text**
Pay attention to:
- What models are available and what they can do (transcription vs. translation)
- What audio formats and file sizes are supported
- What parameters you can pass (language, prompt, temperature, response format)
- What the `prompt` parameter is for and how it influences output
You'll use what you learn here to design your own transcription script and experiments. Don't skim — the parameters section is especially important for Part 2.
---
## Setup
Open your terminal and create a project folder:
```bash
mkdir -p ~/transcription-quest && cd ~/transcription-quest
```
Start Claude Code and have it set up the structure. You want:
- an `audio/` folder for source files
- an `output/` folder for transcripts
- a `scripts/` folder
- a `CLAUDE.md` describing the project
### Get an audio file
**Option A — Record yourself.** Use Voice Memos on your Mac or phone. 1–2 minutes is enough. Save it into `audio/`.
**Option B — Download from YouTube.** From Module 2.1:
```bash
yt-dlp -o "audio/source-video.mp4" "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
```
Pick something short — 2–5 minutes.
### Extract audio (if you downloaded video)
```bash
ffmpeg -i audio/source-video.mp4 audio/source.mp3
```
### Set your API key
```bash
export OPENAI_API_KEY="sk-your-key-here"
```
> Don't paste your key into any file that might get committed to Git. Environment variables are the safe way.
---
## Part 1 — Transcribe Your First File
Ask Claude Code to write a Python script at `scripts/transcribe.py`. Based on what you read in the docs, tell it what the script should do:
- Take an audio file path as an argument
- Send it to the Whisper API
- Save both the raw JSON response and a text-only version to `output/`
You know from the docs what model to use, what response format gives you timestamps, and what package to import. Write the prompt yourself — include those details so Claude Code gets it right on the first try. Make sure you ask for the response format that includes timestamped segments (not just plain text) — you'll need that structure in Part 3 when you clean up the transcript.
Before running the script, read what Claude Code wrote. You should be able to point to the part that calls the API, the part that reads the file, and the part that saves the output.
### Run it
```bash
pip install openai # if you don't have it yet
python scripts/transcribe.py audio/source.mp3
```
Open the transcript and read through it while listening to the original. Notice what it got right, what it missed — names, technical terms, numbers, trailing-off sentences.
> **If it errors:** paste the error into Claude Code. Common issues: wrong file path, missing API key, audio file too large (the docs told you the limit — use `ffmpeg` to compress if needed).
---
## Part 2 — Stress-Test Whisper
You've seen it work on one file. Now find out where it breaks.
### Build test clips
```bash
mkdir -p audio/tests
```
Make **at least 3 short clips** (10–30 seconds each). Ideas:
- **Another language.** Record yourself or find a clip. Does Whisper detect the language? Does it transcribe or translate?
- **Strong accent.** Exaggerated or regional. How does accuracy change?
- **Background noise.** Music playing, whispering, speaking quickly.
- **Technical jargon.** A chemistry lecture, legal proceeding, medical explanation. Does it get the terms right?
- **Overlapping speakers.** Two people talking at once.
- **Proper nouns.** Names of people, places, obscure words — where transcription models tend to hallucinate.
To cut clips from longer files:
```bash
ffmpeg -i audio/some-long-file.mp3 -ss 00:01:00 -t 30 audio/tests/accent-test.mp3
```
### Run them
The script currently overwrites the same output files. Ask Claude Code to fix that so the output filenames are based on the input filename.
Then transcribe each test clip.
### Try the API parameters
You read about `language`, `prompt`, and `temperature` in the docs. Ask Claude Code to update `transcribe.py` so you can pass these as optional command-line flags.
Then re-run some of your test clips with different settings:
- Force a language on a clip that was misdetected
- Use the `prompt` parameter to hint at proper nouns or jargon that Whisper got wrong on the first pass — did it help?
- Try different temperature values on a clip with unclear audio
The `prompt` parameter is context engineering for speech-to-text. Same idea as a `_context/` folder — you're giving the model background it needs to do a better job.
### Track your results
Ask Claude Code to read all the transcript files in `output/` and generate a markdown table at `output/test-results.md` with columns for filename, language, duration, and the first 100 characters of transcript. Leave a notes column blank.
Fill in the notes yourself — which ones were accurate, which had errors, what kind of errors (wrong words, missed words, hallucinated words).
---
## Part 3 — Post-Processing
Raw transcripts have no paragraph breaks and no structure. There are two kinds of post-processing, and they're good for different things.
### Mechanical formatting (regex, string operations)
Some cleanup is purely structural — inserting paragraph breaks, adding timestamps, reformatting speaker labels. These are predictable transformations where the same rule applies every time. A script with regex or basic string operations is the right tool: it's fast, free, deterministic, and won't accidentally change the words.
**Regex** (short for "regular expression") is a way to describe text patterns. For example, the regex `\[Speaker \d+\]` matches any text like `[Speaker 1]` or `[Speaker 23]` — the `\d+` means "one or more digits." You don't need to learn regex yourself — Claude Code can write it for you — but it helps to know that it exists and that it's the standard way programs find and replace patterns in text.
Ask Claude Code to write a script at `scripts/clean-transcript.py` that uses regex and string operations (not an API call) to:
- Read a raw transcript JSON file (path as argument)
- Group segments into paragraphs based on pauses between them
- Add timestamps at the start of each paragraph
- Save the result as a `.md` file in `output/`
Run it and compare the raw and clean versions side by side. Iterate on it — maybe the pause threshold is wrong and the paragraphs are too long or too short.
### Semantic processing (LLM API call)
Other cleanup requires understanding what was said. If your transcript has two speakers labeled "Speaker 1" and "Speaker 2," figuring out which one is the interviewer and which is the interviewee isn't a pattern-matching problem — it requires reading the conversation and noticing who asks questions and who answers them. Same for things like adding section headings, identifying when the topic shifts, or summarizing key points.
Regex can't do any of this. You need to send the transcript to an LLM that can read it and make judgments.
Ask Claude Code to write a script at `scripts/summarize.py` that:
- Reads a cleaned transcript `.md` file (path as argument)
- Sends it to an LLM API (OpenAI or Anthropic — whichever key you have)
- Asks for a summary that extracts the key points
- Saves the result to `output/` as a `.md` file
Run it on one of your cleaned transcripts. Look at what it produces — does it capture the important parts? Does it miss anything you'd consider essential?
Now try adjusting what you ask for. Instead of a general summary, ask for something more specific to the content: action items from a meeting, the main arguments from a lecture, a list of questions the interviewer asked. Change the prompt in the script (or ask Claude Code to) and run it again.
The tradeoff between these two kinds of processing: regex is instant, free, and deterministic — run it twice, get the same result. LLM calls are slower, cost money, and can introduce errors (it might change wording or misattribute a quote). A good pipeline uses each appropriately for what they're good at — for example, regex first for structure, then an LLM pass for anything that requires judgment.
---
## Part 4 — Make It a Pipeline
You've built three scripts that each do one step: `transcribe.py` calls the Whisper API, `clean-transcript.py` formats the output with regex, and `summarize.py` sends it through an LLM. Right now you run them manually, one at a time, passing the right file paths yourself. That's fine for three files — not for thirty.
Ask Claude Code to write `scripts/pipeline.py` that imports the key functions from your other scripts and chains them together. It should take an audio or video file as input and run all the steps in sequence: extract audio with `ffmpeg` if the input is video, transcribe with Whisper, clean up with regex, and summarize with an LLM. If any step fails, it should stop and print what went wrong.
```bash
python scripts/pipeline.py audio/source.mp3
```
Run it on a couple of your test clips too. Then try pointing it at a folder of files and processing them all — ask Claude Code to add a batch mode that loops over every audio file in a directory. This is the scale payoff: the same pipeline that handles one file handles fifty.
Because this is Python and not a shell script, the functions you wrote are importable — which means this pipeline can eventually be called from a web server, an API route, or another script. You're not locked into running it from the terminal.
---
## Where This Goes Next
You now have a pipeline that turns audio into structured markdown. That markdown is raw material — it's what you *build with*, not what you ship. Here are some concrete ways transcripts become web content for a final project:
- **Lecture notes site.** Run the pipeline on a series of lectures. Use the LLM summaries as page content and the timestamped transcripts as a "full transcript" toggle underneath. Add a search bar so readers can find where a concept was discussed across all the lectures — something a stack of audio files could never do.
- **Interview explorer.** Transcribe several interviews on the same topic. Use an LLM pass to extract quotes and tag them by theme. Build a page where readers browse by theme and see quotes from different speakers side by side, with links back to the full transcript and timestamp.
- **Podcast episode guide.** Transcribe episodes, generate per-episode summaries and topic lists, and render them as navigable pages. Add an embedded chatbot (the other project direction) that can answer questions about what was said across episodes.
- **Oral history archive.** Record and transcribe people telling stories — family members, community figures, classmates. The cleaned transcripts become the content; the web layer adds structure, context, and searchability that raw audio doesn't have.
- **Feed it into another pipeline.** The transcript doesn't have to *be* the content. Use it as input to a content generation agent — a transcript of a physics lecture becomes the source material for an interactive textbook page with equations (see the LaTeX quest) and explorable diagrams.
The pattern is the same in every case: the pipeline produces `.md` files, and those files become the `_content` that your Next.js app renders. The transcript is the bridge between something someone *said* and something someone can *explore on the web*.
---
## Key Concepts
| Concept | What It Means |
|---------|--------------|
| Transcription | Converting spoken audio into written text — here done by the Whisper API |
| Whisper | OpenAI's speech-to-text model — send it audio, get back text with timestamps |
| `prompt` parameter | A text hint you pass to Whisper to help with spelling, names, and jargon — context engineering for speech-to-text |
| `language` parameter | Forces Whisper to treat audio as a specific language instead of auto-detecting |
| Post-processing | Cleaning up raw output — adding paragraph breaks, timestamps, formatting |
| Pipeline | A sequence of processing steps where each step's output feeds into the next |