How AI Spell-Check Tells Typos From Style Choices

The Cursor Blinks. The Squiggle Appears.

You type gonna and wait. Nothing. You type recieve and the red line arrives before your finger leaves the E key. Somewhere inside your word processor, a model just made two separate judgment calls, in under a millisecond, and it got both right. That is not a dictionary lookup. That is something actually worth understanding.

Modern AI spell-checkers distinguish intentional stylistic choices from genuine errors by reading context, not just characters. Short answer, done. But the mechanism behind it changes how much you should trust these tools, and more importantly, when you should argue back.

It Stopped Being a Dictionary a Long Time Ago

Old-school spell-check was exactly what it sounds like: your word against a list. Colour wasn't in the American list, flagged. Teh was never in any list, flagged. Simple, brittle, and the source of approximately one million unnecessary corrections to proper nouns.

The shift happened when spell-checkers started training on massive corpora of real human writing: published fiction, journalism, forum posts, transcribed speech. The model didn't just learn which strings of letters form valid words. It learned which strings appear in which situations.

Take gonna. It shows up millions of times in casual dialogue, song lyrics, and direct speech. It almost never appears in legal briefs or academic abstracts. So when you type gonna in a document that already contains contractions, first-person voice, and short punchy sentences, the model assigns high probability to intentional. The same string in a formal memo sits in a very different probability distribution, and the model knows it.

This is the core mechanism: contextual probability, not character matching. The model is always asking, given everything else in this document, how likely is this string to be a mistake?

A Tale of Two Writers

Picture two people using the same app. Maya is drafting a literary short story. Her protagonist speaks in clipped, fragmented sentences. She writes: He left. Didn't say why. Just left. The second sentence is grammatically incomplete. The tool flags it lightly, if at all, because the surrounding prose is clearly literary fiction with an established first-person voice and several prior fragments. The model has built a style profile of the document, and Didn't say why fits it.

Then there's David, writing a technical report on supply chain logistics. He types the same fragment accidentally, mid-paragraph. Same three words. The tool flags it harder, because nothing in the surrounding 800 words of formal, third-person technical prose predicts that sentence structure.

Same string. Different verdict.

That's not magic. It's a transformer-based language model computing the probability of each token given its neighbors, surfacing a flag only when the probability drops below a learned threshold for that document's register.

What People Assume (And Why They're Wrong)

The most common misconception is that these tools operate with a binary correct/incorrect judgment. They don't. They operate with confidence scores, and the flag you see (or don't) is the visible tip of a probabilistic iceberg, which is a less satisfying mental model but a far more accurate one.

A second misconception: adding a word to your personal dictionary is the same as training the tool to understand your style. It isn't. You're whitelisting a string, not teaching context. The tool will stop flagging Kahneman after you add it, but it still won't know whether your use of Kahneman in a sentence about celebrity gossip is deliberate or a paste error.

The third is subtler. People assume that if the tool doesn't flag something, it approved it. What it actually means is that the model's confidence in an error fell below the display threshold. Tools like Grammarly and the grammar layer in recent versions of Microsoft Word both use thresholds calibrated to minimize false positives, not to guarantee correctness. There is a meaningful difference, and conflating the two is how writers end up trusting silence they shouldn't.

Found an unflagged mistake in your own draft? That's not the tool failing. It's the tool making a probabilistic bet that happened to lose.

The Features That Actually Do the Work

So concretely, what signals does the model use?

Register consistency. Formal vocabulary density, average sentence length, passive voice frequency. A document sitting in the 80th percentile for formal register gets treated differently than one in the 20th.

Repetition patterns. If you've written alright (one word) four times already, the fifth instance gets flagged less aggressively. The model weights repeated intentional choices higher than one-offs. It is, in this specific way, more generous than most editors.

Genre signals. Dialogue tags, code blocks, headers, bullet points, and even font metadata in some tools all feed into the genre estimate. A document with section headers and citations gets a different stylistic baseline than a plain-text journal entry.

User feedback loops. Every time you dismiss a suggestion, that signal feeds back. After you've ignored fifteen comma-splice flags, the model starts treating your comma splices as lower-confidence errors. It's building a style model of you, not just your document.

This is why a writer who's been using the same tool for two years gets noticeably fewer false positives than someone who just installed it. By that point the model has accumulated roughly 50 to 200 dismissed suggestions, a realistic count for any regular writer, and quietly recalibrated around them.

The Honest Limit

None of this means the tool understands your intent. It infers intent, probabilistically, from pattern. That distinction is real and it matters.

A genuinely novel stylistic choice, one that sits outside the training distribution, will get flagged regardless of how deliberate it is. Invent a new punctuation convention and the model has no basis for recognizing it as intentional. It will see low probability and surface a flag. The model is, at its core, a very sophisticated measure of how much your writing resembles other writing it has already seen. Think of it as a brilliant but deeply conservative copy editor who has read everything and is made deeply uneasy by anything they haven't.

So here's the question that should genuinely unsettle any writer who cares about originality: if something new, by definition, looks like error from that vantage point, what does it mean that most writers just accept the flag and move on?

These tools are calibrated for the center of the distribution, not the edges. They will catch recieve every single time. But the writer who genuinely pushes language somewhere new will always be fighting the squiggle. That tension is not a flaw in the tool. It is a faithful description of what the tool actually is, and what it was never designed to protect.

The Cursor Blinks. The Squiggle Appears.

It Stopped Being a Dictionary a Long Time Ago

A Tale of Two Writers

What People Assume (And Why They're Wrong)

The Features That Actually Do the Work

The Honest Limit

More Tech*

Why Some Chatbot Responses Feel Human

Why Cracked or Wet Screens Cause Phantom Touches

Why Your Phone Keyboard Suggests Different Words Per App

How AI Detects Subtly Manipulated Images