The Page-Level Signals That Matter: Scoring Content Quality at Scale

Murrough Foley
Murrough Foley
Author·21 min read

Most experienced SEOs have had a working mental model of how Google evaluates content for years. We've known the broad strokes: relevance matching first, then page-level quality signals, then authority and links. The question was always how granular the quality evaluation actually gets — and whether Google is doing something more sophisticated than counting words and checking heading tags.

The 2024 Google API documentation leak, along with the earlier Yandex source code leak, gave us confirmation. Not revelation — confirmation. The signals we found mapped directly to concepts the SEO community had been theorising about. But seeing them named in internal documentation, with specific attribute names and module structures, turned theory into something closer to fact.

So I started experimenting. Could I build scoring rubrics based on these signals and use LLMs to evaluate content against them at scale? And if so, would the scores actually tell me anything useful about why some pages rank and others don't?

How Google's Ranking Pipeline Works (The Short Version)

This isn't new to anyone who's been in SEO for a while, but it's worth laying out because each layer matters for what comes next.

Layer 1: Relevance Matching (BM25)

The first gate is cheap and fast. Google uses BM25 (or a variant of it) to match query terms to documents. It's a term-frequency ranking function from the 1990s, and despite all the advances in neural ranking, some form of BM25 is still the initial filter. If your content doesn't contain the terms and concepts that match the query, it never reaches the stage where quality signals are evaluated.

I think of it simply:

  1. BM25 gets you in the room — your content is relevant enough to be considered
  2. Quality signals determine your seat — where you rank among the relevant results

Most SEO advice focuses on stage 2 while assuming stage 1 is obvious. But I've seen plenty of genuinely excellent content that ranks poorly because it uses different terminology than the searcher, or buries its key points under tangential discussion.

Layer 2: Page-Level Quality Signals

This is where it gets interesting — and where the leaks gave us the most to work with. Google isn't just checking whether your page is relevant. It's evaluating the content itself across multiple dimensions. Three signals from the leaked API documentation stood out:

contentEffort — How much genuine effort went into creating this content? Not word count. Effort in the sense of: how hard would this be to replicate?

originalContentScore — How much of the content represents original contribution versus aggregated or derivative information?

page2vecLq — This one uses page-level vector embeddings to identify topically unfocused or semantically poor pages. The "Lq" likely stands for "Low Quality" — it's believed to be a demotion flag for pages that don't make semantic sense, rather than a positive score for focused content. But the implication is the same: pages that stay tightly on topic avoid the flag.

These aren't the only content signals — there were hundreds of attributes across dozens of modules. But these three address a specific question: does this content have substance, or is it just well-formatted noise?

Domain authority, page-level backlinks, referring domains, brand signals. This is the layer the SEO industry has understood longest, and it's still the most powerful. Koray Tugberk has done excellent work on topical maps, semantic structures, and what Google calls siteRadius — the idea that a site's topical authority radiates from its core subject matter. His research on topical authority is worth reading if you haven't.

The Question

Layers 1 and 3 are well-understood and well-tooled. We have keyword research tools for BM25 relevance. We have Ahrefs and Moz for authority metrics. Layer 2 has tools too — SurferSEO, Clearscope, MarketMuse all measure on-page signals like term frequency, heading structure, and content length. But I was interested in measuring something different: the qualitative signals the leaks pointed to. Not "does this page contain the right terms?" but "did someone actually put effort into this? Is there anything original here? Does it stay on topic?"

Those are harder to measure with traditional NLP. But LLMs are good at exactly this kind of qualitative judgment — if you give them the right rubrics.

Five Dimensions I Chose to Investigate

There are plenty of content quality dimensions you could measure — and existing SEO tools already handle many of them well (keyword coverage, readability, heading structure, content length). I deliberately picked five that are more opaque and aren't well-served by the current toolkit. Three are based directly on leaked signals, one on Google's public quality guidelines, and one on structural best practices that I included as a baseline.

I built these as automated prompts for evaluating competitor content at scale, but they work just as well as manual checklists.

A note on these prompts: What I'm sharing below are simplified versions of the rubrics I use in production. They're enough to start scoring your own content and get a feel for where it sits. But if you want to use them at scale with an LLM, you need to know what you're getting into.

The calibration problem: LLMs aren't deterministic. Run the same prompt on the same content twice and you'll sometimes get different scores — particularly on borderline cases. A page that's genuinely between a 2 and a 3 on originality might score 2 one run and 3 the next. To get reliable results, you have to iterate: find the cases where the model is inconsistent, figure out why it's uncertain, and add specific rules to the prompt that resolve the ambiguity. "If the article names a framework but the advice underneath is standard, score 2 not 3" — that kind of thing.

The niche problem: What counts as "original" in software development content is different from what counts as "original" in health or finance. The score 2 vs 3 boundary shifts depending on what's standard in your industry. I've had to develop niche-specific extensions and examples for every client engagement. A rubric that works for B2B SaaS content needs different calibration than one for e-commerce product descriptions.

The honest time investment: Getting these prompts to produce consistent, reliable scores across a specific niche takes hours of iteration, not minutes. You score a batch, review the results manually, find the disagreements, add rules, rescore, repeat. It's tedious but necessary — and it's the difference between noisy scores that tell you nothing and calibrated scores you can actually make decisions from.

The base rubrics below give you the framework, the scoring tables, and the right questions to ask. Start here, try them on your own content, and expect to spend time refining them for your specific domain.

Each rubric scores content on a 1-5 scale. Most content scores 2-3. If you're consistently hitting 4-5 across all dimensions, you're producing content that's genuinely difficult to compete with.


1. Content Effort — The Replicability Test

Based on: Google's contentEffort signal

This is the dimension I find most interesting, because it cuts through so much of the noise around "content quality." The core question isn't "is this well-written?" — it's "how easily could this be replicated by a competitor or an AI?"

Think about that for a moment. A perfectly written summary of "10 Benefits of Outsourcing" can be produced by anyone with ChatGPT. It might be accurate, well-structured, and genuinely helpful. But it's infinitely replicable. There's no moat.

Compare that to an article that analyses 847 real outsourcing projects with proprietary cost data and named expert interviews. That content has a moat — it took months to create, requires access to data that competitors don't have, and contains insights that can't be generated from existing sources.

The Scoring Rubric

ScoreLabelWhat It Means
5ExceptionalOriginal research, proprietary data, expert interviews. Would take months to replicate.
4HighComprehensive, clear expertise, original analysis. Would take days to replicate well.
3AdequateSolid research, some original perspective. Could be replicated in several hours.
2LowMostly aggregated, template-based. Could be replicated in under an hour.
1MinimalGeneric content an AI could produce in minutes. No original contribution.

How to Use This

When evaluating your own content, ask:

  • Could someone reproduce this by summarising the top 10 Google results? If yes, you're at a 1-2.
  • Does this require genuine research, expertise, or access to create? If yes, you're at a 4-5.
  • Is there something in this content that doesn't exist anywhere else? Proprietary data, original screenshots, documented outcomes, named sources — these are the markers of effort that's hard to fake.

The most important calibration rule: length is not effort. A 3,000-word article of derivative content scores lower than 500 words of original research. Google appears to agree.

Full Content Effort Scoring Prompt (click to expand)

High-Effort Indicators (Score 4-5):

  • Proprietary data or original research
  • Expert interviews with named sources
  • First-hand experience clearly demonstrated
  • Original photography, screenshots, or multimedia
  • Custom-designed visual assets with specific logic
  • Analysis or insights not found in competing content

Low-Effort Indicators (Score 1-2):

  • Generic information available everywhere
  • No original insights or analysis
  • Template-based structure (intro, 5 points, conclusion)
  • Stock imagery only
  • Could be produced by prompting an AI with the title
  • Reads like a rewrite of existing content

The Critical Score 2 vs 3 Boundary: Could a generalist writer with good research skills produce this, or does it require someone who genuinely understands the domain? Generalist could do it → Score 2. Requires domain expert → Score 3.


2. Originality — New Knowledge vs New Labels

Based on: Google's originalContentScore signal

This is the dimension where I see the most self-deception. People genuinely believe their content is original because they wrote it themselves. But "I wrote it" and "it contains original ideas" are different things.

The test is simple: search for the main claims in your article. Would the top 10 results say roughly the same thing? If yes, your content is derivative — regardless of how well you wrote it.

This sounds harsh, but it's freeing once you accept it. Most content is a 2 on originality. That's fine — not every piece needs to be groundbreaking. But knowing where you actually stand lets you make intentional choices about where to invest effort.

The Scoring Rubric

ScoreLabelWhat It Means
5First-to-publishBreaking information or genuine discovery. Creates new knowledge.
4Substantially originalSignificant original analysis. Changes the conversation.
3Mixed / Novel framingConnects known concepts in a new way. Not just a summary.
2Mostly derivativeExplains known concepts well. Competent but interchangeable.
1Fully derivativeRewrite, summary, or aggregation. No unique value.

The False Positive Traps

These are the patterns I see most often that trick people (and scoring systems) into thinking content is more original than it is:

The Naming Fallacy. You group three standard tips and call it "The ABC Framework." Strip the name — is the advice standard? If yes, it's a 2, not a 3. Organisation is not creation.

The Metaphor Trap. You use a clever metaphor to explain a known concept. "Technical debt is like an iceberg." Does the metaphor change how we solve the problem, or just how we describe it? If just description, it's good writing, not originality.

The Expert Tone Trap. "In my experience, you should test your code." Authoritative language with generic advice is still generic advice. Compare: "In our testing, code coverage above 80% reduced production bugs by 34% (n=47 services)." That's evidence. That's original.

Full Originality Scoring Prompt (click to expand)

The Score 2 vs 3 Tie-Breaker:

FeatureScore 2 (Derivative)Score 3 (Mixed Originality)
FrameworksCategorises known thingsOperationalises decision making
SynthesisCombines Source A + Source BCombines A + B to reveal Contradiction C
Perspective"Here is what X is""Here is why the standard view of X is wrong/incomplete"
UtilityI could find this on Google/ChatGPTI would need a specific expert for this insight

The Score 4 vs 5 Boundary: Score 4 = Measurement. "We measured X at $37B." Score 5 = Discovery. "We expected X but found Y, contradicting Z." Being first to publish a data point isn't Score 5 — a weather report is first to publish today's temperature without discovering anything.


3. Topical Coherence — Does Your Content Stay On Topic?

Based on: Google's page2vecLq signal

A note on the approach: page2vecLq is believed to be a negative flag — it demotes pages that are semantically unfocused, rather than rewarding pages that are focused. My rubric inverts this into a positive scoring system (1-5, where 5 is highly focused). The reasoning is practical: if Google penalises incoherent pages, then scoring coherence positively gives us a proxy for how far a page is from triggering that penalty. It's not a perfect mirror of what Google computes, but it measures the same underlying property from the opposite direction.

This is the one that most SEOs underestimate. The instinct is to cover everything related to a topic — cast a wide net, be comprehensive. But the leaked signal suggests Google is measuring something more like semantic focus. How tightly does your content orbit a single topic?

The test is simple: can you summarise what this content is about in one sentence? If you struggle with that, your content has a coherence problem.

I see this most often in "ultimate guide" style content — articles that try to cover an entire domain in 5,000 words, touching 15 subtopics at surface level. Each section individually makes sense, but the whole thing doesn't have a centre of gravity. It's trying to rank for everything and ends up ranking for nothing.

The Scoring Rubric

ScoreLabelWhat It Means
5Highly focusedSingle clear topic with deep, comprehensive coverage. Zero filler.
4Well focusedClear central topic, consistent coverage. Minor tangents that relate.
3Adequate focusIdentifiable main topic but uneven coverage. Some loose sections.
2UnfocusedToo many loosely related topics. No clear throughline.
1IncoherentNo clear central topic. Random collection of information.

Red Flags

  • FAQ sections with unrelated questions. This is the most common coherence killer. A page about "React State Management" with an FAQ asking "What laptop should I buy for programming?" is trying to capture search traffic at the cost of topical focus.
  • Promotional sections disguised as content. A guide about cloud migration that ends with "Partner with CloudExperts for your transformation journey" breaks educational coherence.
  • The "kitchen sink" approach. Trying to cover languages, methodologies, cloud providers, databases, and career advice in one article about "software development."
Full Topical Coherence Scoring Prompt (click to expand)

The Tangential But Related Test: Could this section appear in an article about a different topic? If yes, it's a tangent, not core content.

Surface Coverage Threshold: Covering 8+ subtopics with surface-level treatment = Score 3 maximum. Deep coverage of one topic beats shallow coverage of many.

Pillar Content Exception: Comprehensive guides covering a broad topic can score 4-5 if there's clear organisational logic and each section contributes to a coherent whole. The key is whether it has structure or is just a list of loosely related sections.


4. E-E-A-T Signals — Who Wrote This and Why Should I Trust Them?

Based on: Google's Quality Rater Guidelines

I want to be upfront about this one, because E-E-A-T has a complicated history in the SEO world.

When Google introduced E-A-T (Expertise, Authoritativeness, Trustworthiness) in its Quality Rater Guidelines, the industry — led by practitioners like Marie Haynes — treated it as a direct ranking signal. The reasoning was logical: Google tells human raters to evaluate E-A-T, so Google's algorithm must be measuring E-A-T. Author bios were added to every page. "About Us" pages were expanded. Credentials were plastered everywhere.

Then in 2022, Google added the extra "E" for Experience, and the cycle repeated. E-E-A-T became the answer to every ranking question.

Here's the thing I've come to believe after looking at the leaked documentation and years of testing: E-E-A-T as a direct, measurable ranking signal is less straightforward than the industry assumes. There's evidence that Google applies these trust signals much more aggressively to YMYL (Your Money or Your Life) content — health, finance, legal, safety — than to general informational content. A medical article without author credentials may genuinely be suppressed. A blog post about JavaScript frameworks? Less clear.

I include E-E-A-T in my rubrics not because I'm certain it's a ranking signal for all niches, but because it's a useful quality framework regardless. Content with clear experience markers, demonstrable expertise, and transparent sourcing is better content — whether or not Google explicitly rewards it in your niche. And in YMYL verticals, the evidence for its impact is much stronger.

That said, I think experience is the component that matters more going forward — and the one that's undervalued. Why? Because experience is hard to fake with AI. An LLM can synthesise expert-sounding content from existing sources. It can mimic authoritative tone. But it can't produce the specific, granular details that come from actually doing something — the unexpected problems, the counterintuitive lessons, the specific numbers from a real project.

The Scoring Rubric

ScoreLabelWhat It Means
5ExceptionalClear evidence of all four components. Named expert, demonstrated experience, authoritative sources, transparent methodology.
4StrongStrong evidence of 3+ components. Author credibility established, expertise demonstrated.
3AdequateModerate evidence. Some expertise shown, basic trust signals, but gaps exist.
2WeakMinimal signals. Generic authorship, unverified claims, little evidence of expertise.
1NoneAnonymous author, unsubstantiated claims, no trust signals.

What to Look For

Experience indicators: Specific details only someone who's done the work would know. "We tried X and it failed because of Y" is stronger than "X is recommended for..." Original screenshots, photos, or artifacts from real projects.

Expertise indicators: Demonstrable knowledge that goes beyond what a generalist could research. Not "I'm an expert" — show it through the depth and accuracy of the content.

Authority indicators: Is this the expected place for this information? Are sources cited that are themselves authoritative? Does the broader web reference this author or site?

Trust indicators: Factual accuracy, properly sourced claims, transparency about methodology and limitations. Contact information. Disclosure of potential bias.

Full E-E-A-T Scoring Prompt (click to expand)

Important calibration note: Author attribution (byline, credentials, bio) is a separate consideration from content-level E-E-A-T. A well-written article with clear expertise can score well on content E-E-A-T even before an author byline is added. The author page adds an additional trust layer but shouldn't be the only signal.

Credentials must be relevant. A PhD in biology doesn't make someone an expert on software development.

Claims require evidence. "In my experience..." without specific details is not demonstrated experience.


5. Structural Quality — Does the Formatting Support the Content?

Based on: SEO best practices and web content guidelines

This is the most mechanical of the five dimensions and the easiest to get right — which is why it's frustrating how often otherwise excellent content gets undermined by poor structure. I'm keeping this section short because the advice is straightforward and you've probably heard it before. But from my benchmark work, I can tell you that plenty of pages in the top 30 still get this wrong.

The Scoring Rubric

ScoreLabelWhat It Means
5ExcellentPerfect heading hierarchy, scannable format, strategic use of formatting. Professional editorial quality.
4GoodClear organisation, proper headings, good scannability. Minor improvements possible.
3AdequateBasic organisation, headings used, readable. Some structural issues but functional.
2PoorDisorganised, walls of text, inconsistent formatting. Hard to scan or navigate.
1NoneStream of consciousness. No headings, no formatting. Appears unedited.

The Basics That Matter

Heading hierarchy: H1 for the title, H2 for major sections, H3 for subsections. Never skip levels. Each heading should describe the content that follows — not be clever or vague.

Paragraph length: 3-5 sentences maximum. On the web, shorter paragraphs are almost always better. A wall of text signals "this wasn't written for online reading."

Formatting variety: Use bullet points, numbered lists, tables, bold text, and code blocks where they serve the content. But don't over-format — every formatting choice should make information easier to consume, not just break up text.

Internal and external links: Link to related content on your site and to authoritative external sources. These aren't just SEO signals — they're trust signals. Content that exists in isolation, with no references and no connections, feels less credible.

Full Structural Quality Scoring Prompt (click to expand)

Key calibration rules:

  • Walls of text (paragraphs exceeding 6-7 sentences) cap the score at 3
  • Skipping heading levels (H1 → H3) reduces the score by at least 1 point
  • Length must match depth — 3,000 words of surface coverage scores lower than 1,000 words of focused depth
  • Tables for comparative data, bullets for lists, code blocks for technical content — use the right format for the information type

Putting It All Together

These five dimensions aren't independent — they interact. High originality with poor structure means your insights are buried. Great structure with no effort means you've beautifully formatted a Wikipedia summary. Strong E-E-A-T with weak coherence means a credible author writing unfocused content.

The content that ranks well — consistently, across updates, in competitive niches — tends to score 3+ on all five dimensions and 4+ on at least two. That's a high bar. Most content on the web scores 2-3 on effort and originality, 3-4 on coherence and structure, and 2-3 on E-E-A-T.

How I use these rubrics in practice:

  1. Before writing: I check whether I can score at least a 3 on effort and originality. If not — if I'm planning to write content that could be produced by summarising existing sources — I either find an original angle or don't write it.
  2. During editing: I check coherence. Has the article stayed focused, or has it drifted? Are there sections that could be removed without affecting the core argument?
  3. Before publishing: I check structure and E-E-A-T signals. Are there specific, verifiable claims? Is the formatting helping or hindering? Would a reader trust this content based on what's on the page?

The rubrics aren't perfect. They're my interpretation of signals that Google has never officially confirmed using in the way I've described. But when I tested them against 44,000 SERP results, topical coherence showed a consistent, statistically significant correlation with rank — especially for low-authority sites targeting informational keywords. That's enough to make them useful, even if the underlying theory isn't exactly right.

What the Data Shows

I've tested these rubrics at scale — scoring 44,000 SERP results across 2,212 keywords and running 8 statistical methods to test whether quality predicts rank after controlling for domain authority.

Short answer: yes, but domain authority is 10x more important. Topical coherence showed the strongest signal. Content quality matters most for low-authority sites competing on informational keywords — which is exactly where you'd expect page-level signals to make a difference.

Full findings here.

The next thing I want to test is whether content effort correlates with page-level backlink acquisition — whether the ranking effect of effort is direct or laundered through links. If you're working on similar questions, I'd like to hear about it.

Vorfall des geleakten Google-API-Materials 2024

  • King, M. (2024). « Investigation of the Leaked Google Ranking Algorithm Data » iPullRank. ipullrank.com/google-algo-leak — Erste Analyse, die unter den 2.596 geleakten Modulen Signale wie contentEffort, originalContentScore, page2vecLq identifiziert.
  • Anderson, S. (2024). « The contentEffort Attribute, The Helpful Content System and E-E-A-T » Hobo Web. hobo-web.co.uk — Detaillierte Analyse der Verbindung zwischen dem Signal contentEffort und dem « Helpful Content System ».
  • Fishkin, R. (2024). « An Anonymous Source Shared Thousands of Leaked Google Search API Documents With Me » SparkToro. sparktoro.com — Unabhängige Verifizierung der Authentizität der geleakten Daten.

BM25

  • Robertson, S.E. et al. (1995). « Okapi at TREC-3. » NIST. — Grundlagenarbeit zu BM25.
  • Robertson, S.E. & Zaragoza, H. (2009). « The Probabilistic Relevance Framework: BM25 and Beyond. » Foundations and Trends in Information Retrieval. — Umfassende Studie zu BM25-Varianten.

E-E-A-T und die Richtlinien für die Qualitätsbewertung

  • Google (2024). « Search Quality Evaluator Guidelines » guidelines.raterhub.com — Offizieller Rahmen für E-E-A-T.
Murrough Foley
Let's Connect

Find me on LinkedIn or X.