The Codec Doing the Heavy Lifting
You send a voice message. Thirty seconds, maybe less. Your friend sends back a music file in reply, and when you play them back to back through the same cheap earbuds, yours sounds like it was recorded inside a biscuit tin while the song comes through clean, wide, and full. Same app, same connection, same phone. So what exactly is going on?
The app is using completely different compression strategies for each type of audio, and it's doing so on purpose. Understanding why means looking at what compression actually does, not just what it is.
Compression is not magic. It's controlled loss. An algorithm studies the audio, decides which parts of the signal a human ear is least likely to notice, and throws those parts away, keeping a smaller file that sounds, under most conditions, close enough to the original. The key phrase is "close enough," and that threshold is defined very differently depending on whether you're sending a voice memo or a song.
Built for Speech, Not Symphonies
Voice messages in most messaging apps are encoded with codecs designed specifically for telephone-quality speech. Opus is the most common; WhatsApp uses it, Telegram uses it, and it's baked into WebRTC, the protocol underneath most browser-based calling. Older apps still lean on AMR (Adaptive Multi-Rate), built for mobile networks in the late 1990s and optimised for exactly one thing: intelligible human speech at very low bitrates.
These codecs work by modelling the human vocal tract. They know that speech sits mostly between 300 Hz and 3,400 Hz, that consonants like "s" and "f" carry more meaning than the spaces between words, and that a listener's brain does enormous reconstructive work to fill gaps. So they compress aggressively. Opus can make voice audio intelligible at 6 kbps. Six. A CD-quality stereo track runs at 1,411 kbps, a compression ratio that would turn a music file into something that sounds like it's being broadcast from the bottom of a swimming pool.
When you record a voice note, the app grabs the microphone signal, strips everything above roughly 8 kHz (sometimes 12 kHz in better implementations), crushes the dynamic range, and encodes at somewhere between 8 and 32 kbps. Your voice is recognisable. The emotional tone survives. The guitar chord in the background does not.
The Music File Takes a Different Path
When you send a music file, say an MP3 or an AAC, the app usually treats it as a document or attachment, not as audio to be re-encoded. This is the critical distinction. The file passes through as-is, or gets lightly transcoded at a much higher bitrate using a perceptual audio codec like AAC or MP3, both designed to preserve the full 20 Hz to 20,000 Hz range of human hearing.
Perceptual audio codecs use psychoacoustic models. They know that a loud sound can mask a quieter one at a similar frequency (auditory masking), so they discard the masked signal instead of the prominent one. They preserve stereo width, transient detail, high-frequency shimmer. A music file encoded at 128 kbps AAC sounds, to most listeners in most conditions, indistinguishable from the original. At 256 kbps, almost nobody can reliably tell the difference in a blind test.
The voice codec and the music codec are solving genuinely different problems. One asks: can you understand what this person said? The other asks: can you hear this the way the artist intended? Those are not the same question. They don't deserve the same answer, and it's a small frustration that most people never get told this.
A Worked Example Worth Sitting Down With
Take Priya and James. Both receive the same audio content through a messaging app: a 30-second clip of Priya humming a melody, and a 30-second MP3 of the same melody played on a piano.
Priya's voice note arrives encoded at 16 kbps Opus. The file is about 60 kilobytes. Her voice is clear, but the room's ambient tone is gone, the upper harmonics have been trimmed, and through good headphones you'd catch a faint, slightly hollow quality in the vowels. Totally usable for a message. Not suitable for a demo reel.
The piano MP3 arrives at 192 kbps, roughly 720 kilobytes. The attack of each key, the sustain, the gentle decay into silence: all present. You can hear the room. James plays it back through the same cheap earbuds and thinks it sounds great.
Same app. Same 30 seconds. Twelve times the data, and a qualitatively different listening experience. The app didn't make a mistake. It made a choice.
What People Misread About This
The common assumption is that low voice-note quality is a bug, a cost-cutting move, or a sign that the app is throttling audio. It isn't. It's the appropriate tool for the job, applied correctly.
Where people genuinely do get shortchanged is when apps re-encode incoming audio attachments. Some platforms transcode every audio file that passes through their servers, regardless of original format, to reduce storage costs. If an app takes your 256 kbps AAC file, runs it through a voice-optimised codec at 24 kbps, and delivers that to your friend, that is a different story. That's lossy-on-lossy compression, and each generation of re-encoding stacks the losses. Think of it like photocopying a photocopy: the degradation compounds, and at some point the thing you started with is unrecoverable.
Ask yourself: how many times has a song sounded weirdly muffled after being forwarded a few times through a chat thread? That's not your imagination.
If you're sending something where quality matters, the safest path is a file-sharing service that delivers the original bytes untouched, not a messaging app that may or may not transcode on arrival.
The Codec Is Not Your Enemy
Voice codecs are genuinely impressive engineering. The fact that Opus can carry a comprehensible, emotionally legible human voice at 8 kbps, less data than a single second of uncompressed audio, is not a failure of ambition. It's a deliberate, well-understood trade-off that keeps messaging apps fast on slow connections and storage cheap at scale.
The next time a voice note sounds a little thin, you're not hearing a flaw. You're hearing the exact boundary of what the algorithm decided you needed to hear.
It kept the meaning and dropped the texture. That's a reasonable deal for a two-second voice memo. It's a terrible deal for anything you actually care about sounding good, and knowing the difference is most of the work.