From Air Pressure to Speech: Exploring classical TTS pipeline

## Introduction This is my first long form technical blog post, so apologies in advance for any rough edges. Hope it helps you learn something or gets you interested in the audio side of LLMs. I wrote this post to walk through the TTS stack from scratch: how air pressure becomes waveform data, how waveforms become spectrogram representations, and how neural models use those to generate speech. If you already know the basics of signal capture and representation, feel free to skip ahead to the acoustic-model + vocoder sections. This blog is part 1 of a 4-part series: 1. **Blog 1:** From audio to mel spectrogram - the classical way of representing TTS 2. **Blog 2:** Discrete audio tokens 3. **Blog 3:** Codecs 4. **Blog 4:** Nano-TTS Blogs 3 and 4 will be pretty cool. If you've read Sebastian Raschka's post [The Big LLM Architecture Comparison](https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison), you know he keeps an awesome reference comparing lots of LLM architectures. I wanted to do something similar for Blog 3 - a go-to reference for speech codecs. There's a bunch of different codecs out there in research and production, but there isn't really one place that explains them all well. That's what I'm aiming for. Blog 4 is inspired by Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT). I'm planning to release a small educational repo called **Nano-TTS** - basically a minimal but flexible pipeline: ``` dataset → model → speech output ``` It'll support multiple model and codec configs so people can easily play around with TTS research. Hope this series helps spark some interest in the audio side of ML! This first post sticks to the **classical end-to-end stack**: ``` waveform → spectrogram → mel → acoustic model → vocoder ``` The codec/tokenizer stuff comes later, where I'll explain why speech tokenizers are becoming the go-to architecture for modern TTS. --- ## Sound is a pressure wave Sound is just air pressure moving back and forth. When someone speaks, the air near the speaker is repeatedly squeezed and released. At any moment, that pressure has one value, and over time it becomes a curve. The microphone is just a sensor: it captures those pressure changes and converts them into an analog electrical signal. A microphone contains a thin membrane called a diaphragm. <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/sound_physical_manifestation.svg" style="max-width:720px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> Sound as a physical pressure phenomenon. Source: Wikimedia Commons. </figcaption> </figure> Speech isn't stored as separate tracks like "words", "style", and "speaker". All of that gets mixed together in one pressure signal. At this point, nobody's decoded phonemes, intonation, or identity yet - it's still just a raw physical waveform carrying everything at once. --- ## Microphone internals A microphone is a **transducer**: it converts mechanical air-pressure movement into an electrical signal. In a dynamic mic, air pressure moves a diaphragm and coil, which induces a voltage in a magnetic field. In a condenser mic, air-pressure motion changes the spacing in a capacitor, which changes voltage under a fixed polarization. In both designs the output is continuous, not discrete. That output is an analog voltage waveform whose shape follows the original pressure waveform. The mic does not parse phonemes, timing, prosody, or identity; it only converts physics into an electrical representation that still contains all those factors mixed together. The point of converting sound to electricity isn't analysis - it's translation. Air pressure is tough for computers to handle, but voltage? That's something we can amplify, measure, and digitize pretty easily. <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/condenser_microphone.svg" style="max-width:720px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> A condenser microphone converts sound pressure into a changing electrical signal. Source: Wikimedia Commons. </figcaption> </figure> <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/analog_signal.svg" style="max-width:720px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> An analog electrical signal is a continuous waveform that varies over time. Source: Wikimedia Commons. </figcaption> </figure> The raw mic signal is usually too small to feed directly into converters. A preamplifier applies **gain**, which increases signal voltage to a range where the ADC can use it reliably. Two practical failure modes appear early: **noise floor** and **clipping**. Noise floor is the baseline hiss from electronics; anything close to it becomes unrecoverable. Clipping appears if you push gain too far; waveform peaks flatten, which destroys waveform shape and creates harsh distortion before digitization. Good front-end gain staging is the baseline data quality requirement. --- ## ADC: sampling and quantization **ADC** (Analog-to-Digital Converter) is the exact point where sound becomes numbers. It performs two operations: **sampling** and **quantization**. By this point, sound is already a changing electrical voltage coming from the microphone. Imagine that voltage changing smoothly over time: at one moment it might be +0.12 V, a tiny moment later +0.10 V, then -0.05 V. The signal is continuous and always moving. I am writing these as ordinary voltage values just to explain what the ADC is measuring. Later, when we look at digital waveforms, samples are usually shown relative to a center line, so they can appear as positive or negative values. That does not contradict the ADC idea; it is just a different way of representing the same changing signal. **Sampling** is the step where the ADC checks that voltage at fixed time intervals. If the sample rate is `24 kHz`, it checks the signal `24,000` times every second. So if the ADC looks at four different moments, it might record: - sample 1 → +0.12 V - sample 2 → +0.10 V - sample 3 → -0.08 V - sample 4 → -0.15 V It does not average them together. It simply stores the value at each sampling moment. That is sampling: choosing **when** to measure the signal. <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/signal_sampling_wikimedia.svg" style="max-width:720px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> Sampling: the continuous signal is measured at fixed moments in time. Source: Wikimedia Commons. </figcaption> </figure> **Quantization** happens next. The computer cannot store every possible voltage value exactly, so each measured sample is rounded to the nearest allowed stored level. For example, if the measured value is +0.123 V, it might be stored as +0.12 V. That rounding step is quantization. <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/pcm_wikimedia.svg" style="max-width:720px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> Quantization and PCM: once a sample has been measured, its value is rounded to one of the allowed stored levels. Source: Wikimedia Commons. </figcaption> </figure> **Digital audio is discrete in two ways**: - **discrete in time** (sampling) - **discrete in amplitude** (quantization) That final stored format is **PCM** (Pulse Code Modulation): a list of sample values over time. If you sample too rarely, you miss fast changes in the signal. If you quantize too coarsely, you keep the timing but lose amplitude detail and introduce rounding error, which is called quantization noise. <figure style="width:120%; margin-left:-10%; text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/Generated%20Image%20March%2012,%202026%20-%207_42PM.png" style="max-width:900px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> Sampling decides when we look at the signal. Quantization decides how each measured value gets rounded before storage. </figcaption> </figure> Once sound has been sampled and quantized, it is no longer just air pressure or voltage; it is now a list of numbers. The next question is what those numbers actually mean, and that brings us to sample rate, bit depth, and dB. --- ## Sample rate, bit depth, kHz, dB Now we can name the four terms that show up everywhere in digital audio: sample rate, bit depth, kHz, and dB. **Sample rate** tells us how often measurements are taken. If the sample rate is `24 kHz`, the ADC takes `24,000` samples every second. Here `kHz` just means "thousands per second". So sample rate is a statement about **time resolution**. This directly limits the highest frequency we can represent. According to the **Nyquist-Shannon sampling theorem**, to represent a frequency f, the sample rate must be at least 2f. So the highest frequency you can represent is roughly half the sample rate. That is why `16 kHz` audio reaches about `8 kHz`, `24 kHz` reaches about `12 kHz`, and `44.1 kHz` reaches about `22.05 kHz`. **Bit depth** answers a different question. It does not say when we measure the signal; it says how many possible stored levels each sample can use after quantization. So bit depth is about **amplitude precision**. With `3-bit`, each sample can use `2^3 = 8` levels. With `16-bit`, it can use `65,536` levels. With `24-bit`, it can use many more. Each additional bit increases dynamic range by roughly 6 dB. For example, 16-bit provides approximately 96 dB of dynamic range, while 24-bit provides approximately 144 dB. Suppose the ADC gives us sample values like `120, 127, 118, 130`. That sequence is the **PCM** data itself: raw sampled values over time. If we put those same values into a `.wav` file, the sound is still the same **PCM** audio; `WAV` typically stores PCM, but it can also store other encodings. If we save the same recording as `FLAC`, those **PCM** values are compressed losslessly. If we save it as `MP3` or `AAC`, those are lossy formats, so they make the file smaller by discarding some information. <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/image-1.png" style="max-width:500px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> Higher bit depth gives each sample more possible stored levels. </figcaption> </figure> The last term is `dB`, or decibel. A decibel (dB) is a logarithmic ratio used to compare signal levels, not a new kind of audio signal. Instead of saying "this waveform has a bigger raw amplitude number", people usually say it is `3 dB` lower, `6 dB` higher, or close to clipping. In digital audio, the version that matters most is **dBFS** (decibels Full Scale), which means the level is measured relative to the digital maximum. The easiest thing to remember is this: `0 dBFS` is the top. It is the highest level a digital system can store. If the waveform tries to go above that limit, the peaks get chopped off, and that is clipping. So when people say a recording is peaking at `-6 dBFS` or `-3 dBFS`, they mean it is still below the maximum. When it reaches `0 dBFS`, there is no more headroom left. <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/decibel.png" style="max-width:720px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> Decibels are a logarithmic way of measuring level. For digital audio, dBFS measures signal level relative to the maximum possible digital value. </figcaption> </figure> Once we understand sample rate, bit depth, and level, a digital audio file stops looking mysterious. It is just a long sequence of measurements, each taken at a fixed moment and stored with limited precision. The next question is: what does that sequence look like when we actually plot it? --- ## Waveforms A **waveform** is the simplest way to look at digital audio. You take the stored sample values and plot them against time. The horizontal axis is time. The vertical axis is amplitude. That picture is the waveform. <figure style="text-align:center; margin:2rem 0; background:#ffffff; padding:18px; border-radius:10px; border:1px solid rgba(0,0,0,0.1);"> <img src="https://raw.githubusercontent.com/PranavGovindu/blogggg/main/blog/classicaudio/Screenshot%20from%202026-03-12%2022-46-20.png" style="max-width:720px; width:100%;"> <figcaption style="margin-top:10px; font-size:1.3rem; color:#666;"> A waveform shows how the signal changes over time. In analog form it is continuous; after digitization it becomes a sequence of discrete samples. </figcaption> </figure> So if the audio file contains a sequence of samples, the waveform is just the graph of those numbers. For example, imagine a tiny made-up signal like this: ``` time sample value t0 0 t1 2 t2 5 t3 3 t4 -1 t5 -4 t6 -2 t7 1 ``` If you plot those values from left to right, the line rises, peaks, drops below zero, and comes back up again. That shape is the waveform. A real audio recording just does the same thing with thousands of samples every second instead of eight toy values. This view is useful because some things are easy to see immediately. If the line goes flat for a while, that is silence or near-silence. If the peaks get taller, the signal is getting stronger. If you see a repeating pattern, that often means there is some regular vibration in the sound, like a voiced vowel. If you see a sudden sharp spike, that may be a transient such as a click, a consonant burst, or a recording artifact. But a waveform is also limited. It tells you what the signal is doing over time, but it does not clearly tell you which frequencies are present. So if you stare at a waveform, you usually cannot just read off the phoneme, the vowel quality, or the spectral shape of the sound. The information is there, but it is not organized in the most human-readable way. That is why waveforms are a great first view of audio, but not the final one. They show timing very well, but they hide the frequency structure that matters for speech. To see that structure more clearly, we need to stop asking "what is the amplitude at each moment?" and start asking "what frequencies are present?" --- ## Frequency and the Fourier view of sound Up to now, we have looked at audio as a waveform: a signal changing over time. That is useful, but it hides a different question that matters a lot in sound: how fast is the signal repeating? That is what **frequency** means. If a pattern repeats $f$ times every second, then its frequency is $f$ hertz (`Hz`). If one full cycle takes $T$ seconds, then frequency and period are related by $$ f = \frac{1}{T} $$ So if a waveform completes one full cycle every `0.01` seconds, then $$ f = \frac{1}{0.01} = 100 \text{ Hz} $$ If the same kind of cycle happens every `0.005` seconds, the frequency becomes `200 Hz`. This is why faster repetition usually sounds like a higher pitch. For a perfectly clean tone, the waveform can be written as a sinusoid: $$ x(t) = A \sin(2\pi f t + \phi) $$ where $A$ is the amplitude, $f$ is the frequency, $t$ is time, and $\phi$ is the phase, which tells us where in the cycle we start. That formula is worth reading slowly. It says that a pure tone is just a repeating pattern whose speed is controlled by $f$, whose size is controlled by $A$, and whose horizontal shift is controlled by $\phi$. Real speech is not one clean sinusoid. It is a mixture of many periodic and non-periodic parts at once. That is why the waveform of speech looks complicated even when it sounds structured to us. The **Fourier transform** starts with a powerful claim: even a complicated signal can be understood as a combination of simple waves. Instead of asking only what the signal is doing at each moment, we also ask which frequencies are inside the signal and how strong each one is. That shift in viewpoint is the whole point of the Fourier transform. For a continuous-time signal, the Fourier transform is written as $$ X(f) = \int_{-\infty}^{\infty} x(t)\, e^{-i 2\pi f t}\, dt $$ For each possible frequency $f$, the transform asks: "how much of this frequency is present in the signal?" The output $X(f)$ is complex-valued and tells us the strength and phase of that frequency component. The inverse transform rebuilds the signal from all those frequency components: $$ x(t) = \int_{-\infty}^{\infty} X(f)\, e^{i 2\pi f t}\, df $$ So the Fourier transform is not throwing information away. It is changing coordinates. **Time domain** asks for amplitude as a function of time. **Frequency domain** asks for strength as a function of frequency. Here is a toy example. Suppose a signal is made from two tones: $$ x(t) = \sin(2\pi \cdot 100\, t) + 0.5 \sin(2\pi \cdot 300\, t) $$ In the time domain, that waveform looks more complicated than either tone alone. But in the frequency domain, you would see two main components: - one strong peak near `100 Hz` - one weaker peak near `300 Hz` That is why the Fourier view is so useful. It can separate what looks mixed together in the waveform into identifiable frequency components. This matters directly for speech. Voiced speech has a fundamental vibration rate, usually written as $F_0$, which strongly affects perceived pitch. Vowels have bands of concentrated energy because the vocal tract boosts some frequencies more than others. Consonants often create very different frequency patterns from vowels. So a large part of speech analysis is really about understanding how energy is distributed across frequency. In practice, though, we are not working with perfect continuous-time mathematics. We have sampled audio, which means we have a finite list of numbers: $$ x[0], x[1], x[2], \dots, x[N-1] $$ For that setting, we use the **Discrete Fourier Transform (DFT)**: $$ X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-i 2\pi kn / N} $$ and the inverse DFT: $$ x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] \, e^{i 2\pi kn / N} $$ Again, the meaning matters more than the symbols. $x[n]$ is the sampled waveform. $X[k]$ tells us how much of each discrete frequency bin is present. The **FFT** (Fast Fourier Transform) is just an efficient algorithm for computing the DFT quickly. It reduces computational complexity from O(N^2) to O(N log N), making Fourier analysis practical for real-world digital audio. There is one more important catch. A single Fourier transform gives you the frequency content of one chunk of signal, but it does not tell you how that content changes over time. Speech keeps changing from moment to moment: pitch moves, vowels change, consonants appear, pauses come and go. So the next step is to compute frequency content on many short windows over time. That gives us the **spectrogram**. --- ### Interactive diagrams If you want to play with these ideas instead of only reading them, try out these three interactives below. Each one is self-contained you can tweak the signal, inspect the DFT mechanics, and step through the FFT butterfly. <div style="margin:2rem 0;"> <div style="font-weight:600; margin-bottom:0.5rem;">1. Fourier transform: build a signal and watch its spectrum change</div> <iframe src="https://raw.githack.com/PranavGovindu/blogggg/main/blog/classicaudio/diagram1_fourier_signal_decomposition.html" loading="lazy" style="width:100%; min-height:650px; border:1px solid rgba(255,255,255,0.08); border-radius:12px; background:#0a0b10; box-shadow:0 16px 34px rgba(0,0,0,0.24);"></iframe> </div> <div style="margin:2rem 0;"> <div style="font-weight:600; margin-bottom:0.5rem;">2. DFT math: inspect twiddle factors, per-bin contributions, and the full magnitude spectrum</div> <iframe src="https://raw.githack.com/PranavGovindu/blogggg/main/blog/classicaudio/diagram2_dft_math_interactive.html" loading="lazy" style="width:100%; min-height:650px; border:1px solid rgba(255,255,255,0.08); border-radius:12px; background:#0a0b10; box-shadow:0 16px 34px rgba(0,0,0,0.24);"></iframe> </div> <div style="margin:2rem 0;"> <div style="font-weight:600; margin-bottom:0.5rem;">3. FFT butterfly: step through the stages and compare DFT vs FFT complexity</div> <iframe src="https://raw.githack.com/PranavGovindu/blogggg/main/blog/classicaudio/diagram3_fft_butterfly_animated.html" loading="lazy" style="width:100%; min-height:650px; border:1px solid rgba(255,255,255,0.08); border-radius:12px; background:#0a0b10; box-shadow:0 16px 34px rgba(0,0,0,0.24);"></iframe> </div> --- ## Spectrograms and mel representations A waveform tells us how the signal changes over time, but it does not give a very readable picture of which frequencies are present at each moment. Speech keeps changing too quickly for one global Fourier transform to be enough. A vowel, a fricative, and a pause can all occur within a fraction of a second, so we need a representation that keeps both **time** and **frequency** visible at once. That representation is the **spectrogram**. The standard way to build a spectrogram is the **Short-Time Fourier Transform (STFT)**. Instead of taking one Fourier transform over the entire recording, we take a short window of audio, compute its frequency content, slide the window forward a little, and repeat. Each window is usually multiplied by a window function (such as a Hann window) to reduce spectral leakage. Each window gives us one vertical slice: frequency on the y-axis, energy on the color scale, and that slice's position in time on the x-axis. Put all of those slices next to each other and you get a spectrogram. If you want to see why one global Fourier transform is not enough, try out the interactive below. It uses a toy `"aaa" -> pause -> "sss"` signal drag the analysis window and compare the waveform, the full-clip FFT, the local window FFT, and the spectrogram. <div style="margin:2rem 0;"> <iframe src="https://raw.githack.com/PranavGovindu/blogggg/main/blog/classicaudio/spectrogram-explainer.html" loading="lazy" style="width:100%; min-height:560px; border:1px solid rgba(92,78,59,0.14); border-radius:12px; background:#faf6ee; box-shadow:0 10px 22px rgba(62,44,20,0.08);"></iframe> </div> But a regular spectrogram still uses a **linear frequency axis**, and that is not how we hear. Human hearing gives much more perceptual importance to lower frequencies than to very high ones. The **mel scale** is a frequency scale designed to reflect that. It keeps more resolution where our ears care more and compresses the higher-frequency region where equal linear spacing matters less perceptually. The mel spectrogram is computed by applying a bank of triangular filters spaced on the mel scale to the STFT magnitude spectrum. This turned out to be a very practical representation for TTS. A mel spectrogram is much shorter than the waveform it came from, and it already organizes the sound into a structured time-by-frequency view. Instead of modeling every sample directly, a system can model this more compact acoustic summary first. That is why mel spectrograms became such a central intermediate representation in classical neural TTS. The mel representation discards phase information and compresses frequency resolution at higher frequencies. This information loss is one reason a separate vocoder is required to reconstruct natural-sounding waveform audio. But to see why they worked so well, we have to follow the rest of the stack: how text is prepared, how a model predicts mel frames, and how those frames finally become audible waveform again. --- ## From text to mel: normalization, phonemes, and acoustic decoding Classical TTS systems did not usually go straight from raw text to raw waveform. The front of the pipeline was still symbolic. First the system cleaned up the text into something closer to spoken language. If the input is `"Dr. Smith has 3 apples."`, a normalization step might turn that into `"Doctor Smith has three apples."` before the model ever predicts any sound. After that, many systems converted the text into a pronunciation-oriented representation such as phonemes or other linguistic units. That step matters because spelling and pronunciation do not always line up cleanly. `"read"` can be pronounced in more than one way, and `"cat"` is easier to model as a sound sequence like `/k ae t/` than as three arbitrary characters. This is the last stage where the system is still working with symbols rather than acoustics. The easiest way to think about the next stage is as a move from a **script** to a **performance plan**. The normalized text and phonemes tell the system what should be said and how it should be pronounced, but they still do not say how that sentence should occupy time. A short stop consonant, a long vowel, and a pause all need very different amounts of acoustic space. So before the model can produce sound, it needs a timeline: where each unit starts, how long it lasts, and how the utterance flows from one moment to the next. That is what the encoder and alignment logic are really doing. The encoder builds a contextual representation of the sentence, so each unit is understood in context rather than isolation. Then the system expands that symbolic sequence into a time-aware internal plan. In more intuitive terms, it is deciding things like: this vowel should stretch, this consonant should be brief, this word should carry more emphasis, this phrase should slow down slightly at the end. Some systems learn that timing implicitly; others predict durations more explicitly. That time-laid-out representation is what the **acoustic decoder** turns into a mel spectrogram. A good mental model is to think of the acoustic decoder as painting the utterance one thin time-slice at a time, often frame by frame. Each mel frame is a small vertical slice of the future sound: how much low-, mid-, and high-frequency energy should exist in that moment. The decoder does not yet care about the exact waveform wiggles. It is building an acoustic storyboard. Over many frames, those slices become a full mel spectrogram that describes how the sentence should sound as it unfolds. This is the critical point in the classical stack: the model is not predicting waveform samples directly. It is predicting a mel spectrogram, frame by frame, as an acoustic plan for the utterance. That is why mel was such a useful target. It keeps the time structure of speech, it keeps the broad spectral structure of speech, and it is compact enough that the model can focus on the shape of the utterance instead of every microscopic oscillation. This target is way smaller than waveform audio. A one-second waveform at `24 kHz` has `24,000` samples. A one-second mel representation? Only around `80` to `100` frames, depending on hop size. That's the whole point - the model doesn't need to learn every tiny wiggle in the waveform. It just predicts a compact acoustic representation that captures the stuff that matters. --- ## Vocoder A mel spectrogram is useful, but it is not playable audio. It is more like a structured description of the sound than the sound itself. If you stop at the mel spectrogram, you have something a model can reason about, but not something a speaker can play. A speaker needs a waveform: a dense stream of samples that eventually become electrical motion, membrane motion, and finally pressure changes in air again. That missing stage is the **vocoder**. Examples include WaveNet, WaveGlow, and HiFi-GAN. The cleanest way to think about it is this: if the acoustic decoder writes the **plan**, the vocoder performs the **rendering**. If the mel spectrogram is an acoustic storyboard, the vocoder is the stage that turns that storyboard into the actual moving signal. It looks at the time-frequency pattern in the mel and asks: what waveform would produce exactly this pattern if we analyzed it again? That question is much harder than it looks. The mel spectrogram says where energy should appear over time and frequency, but it does not directly contain the exact waveform wiggles or the fine phase structure of the signal. The vocoder has to fill that detail back in. It has to generate a waveform whose local frequency content, loudness structure, and timing match the mel spectrogram closely enough that the result sounds like natural speech. This also clarifies the division of labor inside the pipeline. The acoustic decoder answers the question, *what should the utterance look like in mel space?* The vocoder answers the question, *what waveform would realize that plan in actual audio?* The first stage works in a compact acoustic representation; the second stage converts that representation into something physical speakers can reproduce. The scale jump here is pretty big. A mel representation might run at roughly 80-100 frames per second, while the final waveform could be 16,000 or 24,000 samples per second. So the vocoder takes a relatively low-rate acoustic sketch and blows it up into a very high-rate signal. That's why the classical stack worked so well: the main model handled the structured prediction job, and the vocoder handled the expensive work of rebuilding detailed waveform audio. --- ## TTS workflow Once those pieces are separated, the full classical pipeline becomes much easier to read: <div style="margin:2rem 0;"> <div style="font-weight:600; margin-bottom:0.5rem;"></div> <iframe src="https://raw.githack.com/PranavGovindu/blogggg/main/blog/classicaudio/tts-scenes.html" loading="lazy" style="width:100%; min-height:640px; border:1px solid rgba(255,255,255,0.08); border-radius:12px; background:#0a0b10; box-shadow:0 16px 34px rgba(0,0,0,0.24);"></iframe> </div> 1. Start with raw text. 2. Normalize it into spoken form. 3. Convert it into phonemes or other pronunciation-oriented units. 4. Use an encoder and alignment logic to build a time-aware internal representation. 5. Use an acoustic decoder to predict a mel spectrogram. 6. Use a vocoder to turn that mel spectrogram into waveform audio. 7. Save or stream the waveform as an audio file. So the central object in the pipeline is the mel spectrogram. It sits between the symbolic front half of the system and the waveform-rendering back half. The front half decides *what should be said* and *how it should unfold over time*. The mel spectrogram captures that as a compact acoustic representation. The vocoder then turns that representation into audible speech. For years, that split was a very effective design. It made the problem cleaner, gave researchers a stable intermediate representation, and avoided direct high-rate waveform prediction in the main model. A large amount of neural TTS progress was built on exactly this `text -> mel -> vocoder -> waveform` division of labor. --- ## Current trend The classical mel-spectrogram pipeline still matters, and many speech systems still use it in some form. But it also has an important limitation: mel spectrograms are still **continuous**, **dense**, and **entangled**. They are much easier to model than raw waveform, but they are still not the kind of clean discrete sequence that language models naturally operate on. That is why a lot of recent speech work has shifted toward **discrete audio tokens** instead. Instead of predicting a dense continuous mel representation and then handing it to a vocoder, newer systems increasingly turn speech into token sequences produced by neural codecs or learned speech tokenizers. In that setting, speech starts to look less like a giant continuous array and more like something a sequence model can generate directly. That's the direction where a lot of the excitement is right now. Systems built around codec tokens (like EnCodec, SoundStream, and Mimi-style stacks) plus multilingual TTS models like VALL-E and Qwen3-TTS are all part of this shift. This post stuck to the classical workflow on purpose - it covered how sound becomes signals, spectra, mel representations, and finally waveform again through a vocoder. The next post will pick up from there and cover the newer direction: how speech becomes discrete tokens, and why that shift matters so much for modern TTS. # Foot Notes: i have taken like a lot of inspiration from [kyutai's blog](https://kyutai.org/codec-explainer) , they have even better animations and explain the tokens and codec part which i will cover in the later blogs , i have tried to keep this blog a bit formal. please do look for my 4 part series! if you guys have read till here , thank you so much!! hope you enjoyed learning new field ; ) ![thanks](https://i.pinimg.com/736x/d5/ec/32/d5ec3272866abbc9d0c3c13bca073f79.jpg) if you guys wanna read more: Nyquist–Shannon Sampling Theorem https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem Short-Time Fourier Transform (STFT) https://en.wikipedia.org/wiki/Short-time_Fourier_transform Mel Scale and Mel Spectrograms https://en.wikipedia.org/wiki/Mel_scale Neural Vocoders WaveNet — https://arxiv.org/abs/1609.03499 WaveGlow — https://arxiv.org/abs/1811.00002 HiFi-GAN — https://arxiv.org/abs/2010.05646