How Streaming Audio Codecs Decide What to Throw Away

The Music Is Already Missing Something

Hit play on a song right now. It sounds complete. Full. Maybe even great. But somewhere between the studio master and your earbuds, a codec made thousands of small decisions about what you'd never notice was gone, then threw that information away. Permanently.

This isn't a flaw. It's the whole design. The question worth understanding is how software decides which parts of a sound are expendable.

The answer lives in psychoacoustics, basically the science of human hearing's blind spots. Codecs don't just compress audio. They model your ears, find the gaps in your perception, and exploit them.

Your Ears Have Blind Spots, and Codecs Know All of Them

Human hearing isn't a flat, even recorder. It's a lumpy, context-dependent system that evolved to detect predators and voices, not to archive music at perfect fidelity. That evolutionary sloppiness is exactly what lossy compression exploits.

Two mechanisms matter most.

The first is absolute threshold of hearing. Below a certain volume level, your auditory system simply doesn't register that a sound exists. The threshold isn't flat across frequencies either. You're most sensitive around 1 to 4 kHz, roughly where consonants in speech live, and much less sensitive at very low and very high frequencies. A codec can quietly discard low-level content in those insensitive ranges and you'll never know.

The second mechanism is masking. This is the interesting one. When a loud sound plays, it temporarily drowns out quieter sounds near it in frequency and time. A cymbal crash at 8 kHz masks a soft violin harmonic at 7.5 kHz playing at the same moment. A kick drum hit also masks quieter sounds for a brief window after impact, typically 50 to 200 milliseconds depending on the signal. That window is called temporal masking, and it's a gift to codec engineers.

The codec calculates the masking threshold in real time: the line below which sounds are inaudible given what else is happening right now. Anything below that line gets discarded or represented with far fewer bits.

The Bit Budget: How the Math Actually Works

Here's a worked example that makes this concrete.

Imagine a three-second clip of a piano chord followed by silence. The encoder splits the audio into short frames, usually 20 to 50 milliseconds each. For each frame, it runs a Fourier transform to see which frequencies are present and at what levels. Then it applies a psychoacoustic model to calculate the masking threshold for that particular snapshot of sound.

The chord frame is loud and complex. Lots of frequencies are active, lots of masking is happening, and the encoder can get away with coarser quantization (fewer bits per frequency band) without the errors becoming audible. The silence frame is the opposite. Almost nothing is happening, the masking threshold is very low, and the encoder needs to be careful with what little signal remains, or the quantization noise becomes audible as a faint hiss.

The encoder has a total bit budget, say 128 kbps or 320 kbps, and it allocates bits across frequency bands dynamically, frame by frame. Loud, busy moments get more bits. Quiet, sparse moments get fewer, not because quality drops, but because fewer bits are needed to stay below the threshold of perception.

MP3 (technically MPEG-1 Audio Layer III) pioneered this approach. AAC, which is what services like Apple Music and YouTube use as a baseline, refined the psychoacoustic model further and improved how it handles transients. Opus, the codec behind most voice and video calling today, pushes the same logic into variable-bitrate territory where the bit budget flexes moment to moment rather than staying fixed.

What People Get Wrong About Bitrate and Quality

Here's the part most guides skip, or get backwards.

Higher bitrate doesn't mean better encoding. It means more room to work with. A well-tuned 128 kbps AAC encoder will beat a carelessly implemented 192 kbps MP3 encoder on most real-world material, because the psychoacoustic model matters more than the raw number. The codec's intelligence is in the model, not the budget.

The folk wisdom that says "always use 320 kbps MP3" needs to die. At a certain point, you're not getting better audio. You're just storing quantization noise more precisely.

There's also a widespread belief that streaming services destroy audio quality in ways listeners can reliably detect. The research is messier than audio forums suggest. In controlled blind tests, trained listeners struggle to reliably distinguish high-quality 256 kbps AAC from lossless on typical consumer headphones. The gap exists. It's just smaller than the subjective experience of knowing you're listening to a compressed file would suggest, and expectation bias is doing heavy lifting the codec gets blamed for.

Still, the gap isn't zero. Certain material exposes lossy encoding more than others: sparse acoustic recordings, music with sharp transients like brushed snare or plucked strings, and content with lots of high-frequency air. A solo guitar recorded in a quiet room will reveal compression artifacts that a dense rock mix buries completely.

The Moment It Falls Apart

Consider two people who download the same playlist before a flight. Maya grabs it at the service's default setting, 128 kbps AAC. Priya downloads it at the maximum, 320 kbps. On the plane, through the same pair of over-ear headphones, both listen to the same jazz trio recording. Maya hears a faint underwater warble on the piano's upper register during a quiet passage, like the note is being heard through a slowly spinning fan. Priya doesn't. Neither of them hears a difference during the louder ensemble sections.

That warble is pre-echo, one of the most recognizable codec artifacts. It happens when the encoder misjudges a transient, spreading quantization noise backward in time across a frame boundary. The psychoacoustic model predicted masking that didn't actually exist. Modern codecs like Opus handle this better than old MP3 encoders by using shorter frames near transients, but no codec eliminates it entirely at low bitrates.

The failure mode is instructive. The codec isn't randomly degrading the signal. It's making a prediction about your perception, and occasionally the prediction is wrong.

The Codec Is Modeling You, Not Just the Music

So here's the real question worth sitting with: when you hear compressed audio and think it sounds fine, is that because the codec did its job, or because you can't tell the difference?

Both, mostly. When a streaming service encodes audio, it isn't just asking how small it can make the file. It's asking what this particular listener is unlikely to notice. The psychoacoustic model is a theory of human hearing baked into software, updated and refined over decades of research.

That model is genuinely impressive. It's also a guess, applied to everyone equally, calibrated against average hearing in controlled conditions. Your ears, your headphones, your listening environment, and your musical training all shift where the perceptual thresholds actually sit for you.

The codec is making a bet about your blind spots. Most of the time, it wins. That streaming audio sounds as good as it does at the bitrates it uses isn't a coincidence. It's forty years of engineers learning, with remarkable precision, exactly how much you won't miss.

The Music Is Already Missing Something

Your Ears Have Blind Spots, and Codecs Know All of Them

The Bit Budget: How the Math Actually Works

What People Get Wrong About Bitrate and Quality

The Moment It Falls Apart

The Codec Is Modeling You, Not Just the Music

More Tech*

Why Podcast Audio Sounds Different Across Apps

Why Wireless Earbuds Cause Ear Fatigue (And Some Don't)

How Streaming Recommendation Algorithms Actually Work

What Streaming Apps Send Back While You Watch