The Mouth Is the Hard Part

You're watching a Portuguese thriller, dubbed into English, and for the first thirty seconds your brain makes peace with it. Then something slips. The actress finishes a word half a beat before her lips do, and suddenly you can't un-see it. You're not watching a character anymore. You're watching a mouth problem.

That gap is the entire problem AI dubbing is trying to solve. Solving it turns out to require rebuilding a human face, one frame at a time.

The audio swap is the easy bit. Modern neural translation pipelines can take a Spanish script, generate an English voice that even clones the original speaker's timbre, and drop it onto a timeline in minutes. What they can't do for free is change what the mouth is doing. A Spanish speaker saying febrero holds a lip-rounded "b" for a full beat. The English word "February" front-loads a hard "f" instead. Same meaning, completely different face.

So AI dubbing tools don't just overdub. They repaint.

Three Layers Stacked on Top of Each Other

The reconstruction process has three distinct stages, and understanding them separately makes the whole thing click.

Stage one: landmark detection. Before any synthesis happens, the system maps the face. It identifies somewhere between 68 and 468 anchor points (the exact number depends on the model; Google's MediaPipe Face Mesh uses 468) that mark the corners of lips, the edges of teeth, the line of the jaw. Every frame of the source video gets this map burned onto it. The software now knows, precisely, where the mouth is in three-dimensional space relative to the camera, even when the actor turns their head.

Stage two: phoneme-to-viseme matching. A phoneme is a unit of sound. A viseme is its visual equivalent: the mouth shape a face makes when producing that sound. English has roughly 44 phonemes and about 14 distinct visemes, because several phonemes share the same mouth position, which is why lip-reading is genuinely hard. The translated audio track gets broken into its phoneme sequence, each phoneme gets matched to its viseme, and the system now has a target blueprint. At timestamp 2.34 seconds, the mouth should look like this.

Stage three: neural face synthesis. This is where the repainting actually happens. A generative model, usually a GAN variant or a diffusion-based renderer, takes the original video frame, the original face texture, the lighting conditions, and the target viseme shape, then outputs a new frame where the mouth region has been replaced. The replacement has to match the actor's skin tone, the shadows under their chin, the slight asymmetry of their real smile.

Done well, you never see the seam. Done badly, you get the uncanny valley in a very specific location: a perfectly normal face with a mouth that moves like a sock puppet.

A Worked Example: Two Minutes of Dialogue

Picture a two-minute scene from that Portuguese thriller. The lead actress delivers a monologue. The English dub runs about eight seconds longer, because Portuguese is a syllable-timed language and English tends to need more words to carry the same information.

Eight extra seconds of audio, same amount of screen time. The system can't slow the video down, so it does several things at once. It compresses some pauses in the original footage by a fraction. It identifies moments where the actress's mouth is nearly closed, natural breath points, and trims a few frames there invisibly. For the remaining mismatch, it synthesizes new mouth movements that fit the English phoneme sequence into the actual time available, warping the jaw and lip positions frame by frame.

Not perfect. A careful viewer watching a side-by-side will spot the softened jaw on a hard consonant around the ninety-second mark. But on a first watch, at normal speed, on a television across the room, it holds.

That's the real benchmark these tools are chasing: not forensic accuracy, but perceptual plausibility. Those are not the same thing, and the industry would do well to stop pretending otherwise.

What People Get Wrong About This Technology

The common assumption is that AI dubbing is basically autocomplete for faces. Point it at a video, feed it a translation, press go.

It isn't.

Profile shots break the pipeline badly. Landmark detection depends on having enough of the face visible to anchor the 3D model. A character filmed at 45 degrees, looking away, gives the system too little to work with, and synthesis artifacts multiply fast. Most current tools quietly flag these shots and hand them back to a human compositor, which tells you something about where the actual confidence level sits.

Speaker identity is another real constraint. The viseme synthesis is trained on faces in general, not this specific face. An actor with an unusually wide mouth, a pronounced overbite, or heavy facial hair covering the lip line produces outputs that drift from their actual appearance. Studios using these tools at scale run a fine-tuning pass on each lead actor's face before production starts, sometimes feeding the model hundreds of close-up reference frames just to tighten that drift. It's less like running software and more like fitting a bespoke suit.

Then there's emotion. A viseme for the "m" sound looks the same whether the character is whispering an endearment or spitting a threat. Geometrically identical. The surrounding face, the eyes, the brow, the tension in the cheek muscles, those carry the emotional register. AI dubbing tools that only retouch the mouth region can produce a technically synced lip movement that feels completely wrong because the rest of the face is still performing the original language's emotional cadence.

The better systems now synthesize a wider facial region. Some extend the synthesis mask to include the chin, the lower cheeks, and the throat, because swallowing and jaw tension are visible, and they tell the story too.

The Seam You're Actually Looking For

If you want to spot AI dubbing in the wild, don't watch the lips. Watch the throat.

The larynx moves when people speak, and it moves differently for different vowels. Current synthesis tools rarely touch that region. When the audio says one thing and the throat is doing something slightly different, your brain logs it as wrongness without quite knowing why. Have you ever felt vaguely unsettled by a dub you couldn't objectively fault? That's probably where the problem was.

When you're watching a dubbed production and you never consciously notice it, the tool did its job. That's not a low bar. Getting the human perceptual system to stand down is genuinely hard, and the distance between "close enough" and "invisible" is where most of the serious engineering effort actually lives.

The mouth is the hard part. The throat is the part nobody's fully solved yet.