Skip to main content
ai-detectionguidehow-to

Perplexity and Burstiness Score: What They Mean in AI Detection

· 7 min read· NotGPT Team

A perplexity and burstiness score is a two-part measurement that most AI detectors use to estimate whether a piece of text was written by a human or generated by a machine. Perplexity captures how predictable each word choice is given the words that came before it; burstiness captures how much sentence length varies across the passage. Together, these two numbers form the statistical backbone of AI text detection — but they carry real limitations that anyone who writes, teaches, or edits professionally should understand before acting on a result.

What Is a Perplexity Score?

Perplexity is a concept borrowed from information theory and adapted for natural language processing. When a language model reads a sentence, it tries to predict each next word based on everything it has seen so far. If every prediction comes easily — if the model could have written this sentence itself — perplexity stays low. If words arrive in unexpected combinations or unusual registers, perplexity rises. AI detectors use this property because large language models generate text by selecting statistically likely sequences. The output naturally clusters near high-probability word choices, which means it tends to score low perplexity consistently across a passage. Human writing, by contrast, makes more idiosyncratic choices: different vocabulary registers within the same paragraph, unexpected comparisons, incomplete trains of thought that resolve later, or subject-matter-specific jargon that a general-purpose model would not default to. These features produce higher perplexity on average. The practical problem is that clear, formal writing deliberately avoids surprises. Academic essays, legal briefs, technical documentation, and standardized test responses all use controlled vocabulary and structured argumentation — patterns that push perplexity toward AI-typical ranges even when every sentence was written by hand. This overlap between clean human writing and AI output is the root cause of most perplexity-based false positives, and it is why perplexity alone is not sufficient to make a reliable determination of authorship.

Perplexity does not measure quality or intelligence. It measures predictability — how closely the text follows the paths a language model would have taken through that sentence.

What Is a Burstiness Score?

Burstiness measures variation in sentence length across a passage. A high burstiness score means the text alternates unpredictably between short and long sentences — a quick declarative after an extended subordinate clause, a fragment for emphasis, a run-on that carries momentum before breaking into a shorter follow-up. This is the natural rhythm of human writing. Most people mix sentence lengths without thinking about it; the variation emerges from changes in thought complexity, pacing decisions, and personal style developed over years of reading and writing. AI-generated text tends to cluster sentences near a consistent length, even when individual sentences look normal on their own. The model is not making conscious pacing decisions — it is completing one sequence and starting another, and the underlying statistics pull each sentence toward a similar shape. A passage of AI text often reads as smooth but also metronomic: each sentence lands with similar weight and rhythm. Detectors score this evenness — uniformly structured text raises the probability of AI authorship, while varied sentence length is treated as a human signal. Burstiness is considered the more reliable half of the pair precisely because the variation humans produce has no consistent underlying pattern. When AI tools are prompted to vary sentence length explicitly, the result often reads as choppy rather than natural, and that unnaturalness itself can become detectable to a trained model.

Burstiness is the metric AI writing tools struggle most to mimic convincingly. Human sentence-length variation has no fixed formula, which makes it genuinely hard to fake at scale.

How a Perplexity and Burstiness Score Are Combined Into a Single Result

Most AI detectors report a single AI-probability percentage rather than two separate numbers, because the perplexity and burstiness score are combined inside the model before the result reaches the user. Text that scores low on perplexity and low on burstiness — predictable word choices and uniform sentence length — receives a high AI-probability output. Text that scores high on both tends to return as likely human. When the two metrics point in different directions, detectors rely on secondary signals to resolve the disagreement. These secondary signals include vocabulary distribution (AI text favors certain mid-frequency words over rare or highly colloquial ones), transition-word density (AI writing overuses formal connectors like furthermore and moreover), paragraph length uniformity, and the near-total absence of the small grammatical irregularities that appear in unedited human prose. The combination approach is why newer detectors outperform older tools that relied on perplexity alone. A single metric is relatively easy to game — modifying prompts or adding certain instructions can raise perplexity on AI output without meaningfully changing how the text reads. A model that cross-checks multiple signals at once is significantly harder to fool consistently, though still not infallible. Understanding which signals your detector uses beyond perplexity and burstiness helps explain why scores vary between tools. Two detectors analyzing the same text can return different probabilities because they weight secondary signals differently or were trained on different datasets. This inconsistency is one reason domain experts caution against using any single detector as a sole source of truth.

  1. Low perplexity + low burstiness = strong AI signal in most current detectors.
  2. High perplexity + high burstiness = strong human signal.
  3. Mixed results (one high, one low) trigger secondary analysis of vocabulary distribution and structural patterns.
  4. No single threshold is universal — each detector calibrates its own cutoff based on its training data.
  5. The final percentage is a probability estimate, not a binary determination of authorship.

Why a Perplexity and Burstiness Score Can Wrongly Flag Human Writing

False positives — human text flagged as AI-written — are the most consequential limitation of perplexity and burstiness scoring. Non-native English speakers are disproportionately affected. When someone writes in a second language, they often choose safer, more predictable vocabulary and avoid complex syntax, compressing perplexity scores toward AI-typical ranges without any machine involvement. A 2023 study from Stanford found that AI detectors flagged non-native English writing as AI-generated at significantly higher rates than native-speaker writing — a direct consequence of how perplexity scoring handles limited vocabulary range. Standardized academic formats compound the problem. Five-paragraph essays, technical reports, and standardized exam responses impose structure that flattens both metrics: defined paragraph order reduces perplexity, and deliberate editing for clarity smooths sentence-length variation. Heavily revised writing of any kind is at risk. Multiple editing passes strip out the irregularities that signal human authorship — the stray em dash, the sentence that runs slightly too long before a hard stop, the paragraph that breaks the expected structure. The text becomes cleaner and more uniform with each pass, and both metrics shift in the direction a detector associates with AI output. Conversely, AI-generated text can evade detection when writers use system prompts specifically designed to introduce variation, or when AI output is edited extensively before submission. The scores are probabilistic estimates based on statistical patterns — they are not direct evidence of how a text was produced.

A high AI-probability score is a flag, not a verdict. Detection tools estimate the statistical likelihood that a model produced the text — they do not observe the act of writing.

How to Respond When a Score Flags Your Writing

When you receive a score that comes back higher than expected, start by looking at which passages the detector highlighted rather than fixating on the single percentage. Perplexity-driven flags cluster around technical sections, formulaic openings, and heavily edited conclusions — places where vocabulary naturally becomes controlled and predictable. Burstiness flags appear in sections where you deliberately trimmed sentences for clarity or where the subject matter imposed a consistent rhythm, such as step-by-step instructions or numbered lists. To bring a score down on writing you produced yourself, vary sentence structure intentionally: let a short declarative stand alone after a longer explanation, use specific personal examples or cited details that a general-purpose AI model would not generate, and avoid long chains of similar-length sentences in any single paragraph. Replacing generic transitions with more specific connectors, or no connector at all, also helps loosen the uniformity a detector reads as suspicious. If you are reviewing someone else's work and relying on these scores in an academic context, treat a high number as a reason to look more closely — not as final evidence. Combining the score with draft history, cited sources, and the specificity of the argument produces a more defensible assessment than a single perplexity and burstiness score in isolation.

  1. Read the highlighted passages in the report rather than fixating on the total score alone.
  2. Check whether flagged sections are technical, formulaic, or heavily edited — the most common drivers of false positives.
  3. Rewrite flagged passages by alternating short and long sentences deliberately.
  4. Replace generic transition words with specific connectors, examples, or no transition at all.
  5. If reviewing someone else's work, pair the score with draft history and in-class writing samples before drawing any conclusions.

Check Your Own Text Before Someone Else Does

Running your draft through a detector before submitting lets you see where the perplexity and burstiness score lands and which specific sentences are driving the result — before an instructor, editor, or HR reviewer does. This kind of pre-submission check has become routine for students working on high-stakes assignments, professionals submitting reports to editorial teams, and writers who use AI assistance during drafting and need to understand how the final version reads to a detection algorithm. It is also a useful exercise simply to understand your own writing patterns: you may find that certain sections of your work consistently score as more predictable, not because you used AI, but because of habits in how you structure arguments or choose vocabulary. The goal is not to game a system — it is to understand what the numbers reflect about your writing patterns and fix misleading signals before they create a problem. NotGPT's AI Text Detection tool returns a probability score with sentence-level highlighting so you can see exactly which passages drive the flag. If sections read as machine-like even in writing you produced yourself, the Humanize feature can rewrite them at Light, Medium, or Strong intensity to restore variation while keeping your meaning intact.

Detect AI Content with NotGPT

87%

AI Detected

“The implementation of artificial intelligence in modern educational environments presents numerous compelling advantages that merit careful consideration…”

Humanize
12%

Looks Human

“AI in schools has real upsides worth thinking about — but the trade-offs are just as real and shouldn't be glossed over…”

Instantly detect AI-generated text and images. Humanize your content with one tap.