I Built a 2,000-Page Web Content Extraction Benchmark. Here's What I Found.

Murrough Foley
Murrough Foley
Author·11 min read

After the Google API leaks in 2024, I started looking at ranking signals differently. Not the usual title tags and backlinks stuff — the deeper signals. Things like contentEffort, Originality, TopicalCoherence. The kind of signals that Google uses to figure out whether a page has genuine substance or is just well-formatted noise.

I was investigating a site and noticed something interesting: their pages scored well on almost every quality axis I could think of. Good heading structure. Original data points throughout. Content that built on itself rather than repeating the same keywords. Strong topical coherence across the site. It wasn't just well-written — it was well-structured at the HTML level.

That got me thinking. In a world where LLMs can produce competent prose on any topic in seconds, what actually differentiates web pages? Writing ability is approaching zero cost. Expert knowledge is being commoditised. So what's left for Google to rank on?

My answer: not much will change fundamentally. It will still be links, the topical authority of a site, and after that — original data, personal expertise, and structural quality signals that are hard to fake at scale. The pages that win will be the ones with genuine first-party data, real experience, and content structures that reflect actual depth rather than template-driven filler.

I wanted to measure this. At scale. Across the SERPs.

The Extraction Problem

Pulling clean content from a single site is easy with jaccard similarity and templates, but the web is a messy place, and trying to do it from 10 different sites on page 1 of a SERP is a different type of problem and to analyze content quality signals across thousands of competitor pages, you first need to extract the actual content from each page. Sounds simple — just grab the article text, right?

It is simple, if every page is a blog post. Trafilatura, Readability, Newspaper — they all do a solid job on standard article pages. Give them a WordPress post or a news article and they'll hand back clean text, 90%+ of the time.

But the web isn't just articles. Mix informational content with a landing page design — a SaaS features page, a product description page, a documentation site — and these tools start falling apart. They were built for articles, and they apply article-extraction heuristics to everything.

I needed to extract content from:

  • Product pages where the description lives in a JSON-LD blob, not the visible DOM
  • Service/landing pages where content is spread across 10 different <section> elements — hero, features, testimonials, pricing, FAQ
  • Forum threads where the user posts ARE the content, but extractors filter them out because they match class="comment" boilerplate patterns
  • Documentation pages where I needed the actual docs, not the sidebar navigation
  • Collection/category pages where the only meaningful text is a paragraph above a product grid

And I needed it in markdown — with headings, links, tables, bold text preserved — because that's what I'm analysing for SEO signals. A flat text dump loses the structural information that matters most.

What Started as a Port Grew Into Something Bigger

It started as a quick Rust port of Trafilatura, Adrien Barbaresi's excellent Python extraction library. I needed speed — processing thousands of pages through a Python pipeline was a bottleneck — and Rust fits the bill.

But as I started testing across different page types, I kept hitting the same problem: one-size-fits-all extraction doesn't work on the modern web. A product page needs different handling than a forum thread. A SaaS landing page needs different handling than API documentation.

So I added page type classification. Then type-specific extraction profiles. Then a confidence score so I'd know when the extraction was likely garbage and needed a second pass. Then a markdown output mode. Then an ML classifier when the rule-based heuristics hit their ceiling.

The result is rs-trafilatura — a Rust web content extraction library that classifies pages into 7 types and applies type-specific extraction strategies. It processes pages in 44ms on average. It outputs markdown with headings, links, tables, and formatting preserved.

But to know if any of this actually worked, I needed a benchmark. And that's where the real rabbit hole began.

Why Existing Benchmarks Don't Cut It

When I went looking for a benchmark to evaluate my extractor, I found:

  • ScrapingHub (2019): 181 pages. All articles.
  • CleanEval (2007): 797 pages. From the pre-HTML5 era.
  • Google-Trends (2017): 180 pages.
  • L3S-GN1 (2010): 621 news articles.

Every benchmark was either tiny, old, or article-only. On articles, every extractor looks great — they all score F1 > 0.90. It's like evaluating a car's performance on a straight road and concluding they're all equally good.

The interesting question isn't "can you extract a blog post?" — it's "can you handle the 40-50% of the web that isn't a blog post?"

Nobody was measuring that.

Building the Benchmark

So I built the Web Content Extraction Benchmark (WCXB). It took considerably longer than I expected.

2,008 pages from 1,613 domains across 7 page types:

Page TypeCountWhat It Tests
Article1,050The baseline — blogs, news, editorials
Service224Landing pages, SaaS features pages, marketing
Forum164Discussion threads, Q&A, community posts
Product147Product descriptions, specs, pricing
Collection151Category pages, product grids
Listing139Content indexes, course catalogs
Documentation133API docs, tutorials, technical references

Each page has a full ground truth annotation — title, author, date, complete main content as plain text, plus "must include" and "must not include" snippet arrays that test content boundaries. The evaluation approach was heavily influenced by Bevendorff et al.'s comparative study at SIGIR 2023 — their finding that "performance is quite genre-dependent" was a direct motivation for this benchmark. I've tried to match their standard of rigour while extending the scope to page types their combined dataset didn't cover.

The annotation process was... involved. LLM-assisted drafting, then four review passes using Claude Opus agents to verify and fix quality issues, followed by my own manual review. Automated quality scans. Adversarial review where I'd investigate every file where my extractor disagreed with the ground truth.

I split the dataset into a 1,497-page development set and a 511-page held-out test set with matched page type distributions. The test set was never touched during development — so when I evaluate on it, the results are genuine.

The whole thing is released under CC-BY-4.0: GitHub | Zenodo (DOI: 10.5281/zenodo.19316874) | HuggingFace. And I hope my little contribution to the community helps someone.

What the Results Show

I ran 12 extraction systems on WCXB — 10 heuristic-based and 2 neural (LLM-based). Here's the overall picture:

SystemF1TypeSpeed
rs-trafilatura0.859Heuristic + ML44ms/page
MinerU-HTML (0.6B)0.827Neural (A100)1,570ms/page
Trafilatura0.791Heuristic94ms/page
dom-smoothie0.762Heuristic27ms/page
ReaderLM-v2 (1.5B)0.741Neural (A100)10,410ms/page

But the overall numbers hide the real story. Here's what happens when you break it down by page type:

Page Typers-trafilaturaMinerU-HTMLTrafilaturaReadability
Article0.9320.9280.9260.825
Documentation0.9310.8380.8880.736
Service0.8430.8240.7630.604
Forum0.7920.7940.5850.466
Collection0.7130.5060.5530.445
Listing0.7040.7100.5890.496
Product0.6700.6190.5670.407

On articles, everyone's within 1-3 points of each other. On forums? A 33-point spread. On collections? 27 points. On products? 26 points.

This is what article-only benchmarks hide. You test on news articles, everything looks fine. You test on the actual diversity of the web, and the cracks show.

The Surprising Finding About LLMs

I expected the neural/LLM-based extractors — MinerU-HTML and ReaderLM-v2 — to handle page type diversity better than heuristic systems. They're trained on diverse data, they understand context, they should generalise.

They don't.

MinerU-HTML (a fine-tuned 0.6B model running on an A100 GPU) scores 0.928 on articles — nearly matching heuristic systems. But on collections it drops to 0.506. On products, 0.619. On forums, 0.794.

ReaderLM-v2 (1.5B parameters — a bigger model) does even worse: 0.417 on collections, 0.463 on products. And on listings, MinerU-HTML actually edges out rs-trafilatura (0.710 vs 0.704) — it's the one page type where the neural approach wins.

These neural systems were trained predominantly on article-like content. They've learned to extract articles really well. They haven't learned to handle product pages where content lives in JSON-LD, or service pages where content is distributed across a dozen DOM sections, or forums where class="comment" elements ARE the content. It's worth noting that MinerU-HTML's team built a larger internal benchmark (WebMainBench, 7,809 pages) but to my knowledge the full dataset hasn't been publicly released — only a 100-page evaluation subset is available. Without access to their training data distribution, it's hard to say whether this article bias is baked into the training set or the architecture.

Bigger model doesn't help either. ReaderLM-v2 at 1.5B scores lower than MinerU-HTML at 0.6B on every single page type.

The lesson: the "throw an LLM at it" approach works for articles but doesn't solve the diversity problem. For that, you need type-aware extraction — knowing what kind of page you're looking at and adjusting your strategy accordingly.

What This Means for SEO Analysis

Coming back to where I started — analysing content quality signals across the SERPs — this work gave me three things:

1. Reliable extraction across page types. I can now extract clean, structured content from product pages, service pages, documentation, forums — not just articles. This matters because the SERPs for commercial queries are full of non-article pages.

2. Markdown output with structure preserved. Headings, links, tables, bold text — all the structural signals that matter for SEO analysis. I can measure heading hierarchy depth, internal link patterns, content-to-boilerplate ratios, and structural coherence at scale.

3. A confidence score for extraction quality. When the extractor isn't sure it got good content (SPAs, unusual layouts, heavy JavaScript), it tells me. I can route those pages to a different pipeline — or just flag them for manual review.

For SEO analysis workflows — where I'm looking for original data points, proprietary research, and first-party expertise signals across competitor content — clean extraction is the foundation everything else builds on. You can't measure content originality if you're measuring boilerplate.

Using this pipeline, I've done some interesting research into how content quality signals correlate with rankings. I scored 44,000 SERP results on content effort, originality, and topical coherence — here's what I found.

Work in Progress

I should be honest — while the benchmark is done and released, rs-trafilatura still needs work. Product page extraction (F1 = 0.670) and collection pages (0.713) are areas where I'm not satisfied with the results. The ML classifier does a decent job at 87% accuracy, but the minority page types — listings, services — are where it struggles most. There's also the title extraction to improve, and the markdown output pipeline needs another pass on tables and code blocks.

This whole project has been a labour of love. What started as a quick Rust port for speed turned into months of annotation work, ML experimentation, and benchmark infrastructure. But I'm happy to share it. If it helps someone build a better extractor, or saves a team from building their own evaluation set from scratch, then it was worth the effort.

Try It Yourself

The benchmark is open source. If you're building or evaluating content extraction tools:

If you're an SEO working with content analysis at scale, or building RAG pipelines, or training LLMs on web data — this benchmark will show you where your extraction pipeline is actually failing.

The web isn't just articles. Your extraction benchmarks shouldn't be either.

Extraktionssysteme

  • Trafilatura — Barbaresi, A. (2021). «Trafilatura: Eine Bibliothek und Kommandozeilenwerkzeug zur Texterkennung und -extraktion im Web-Scraping» ACL 2021. GitHub
  • MinerU-HTML (Dripper) — Liu, M., et al. (2025). «Dripper: Token-effiziente Extraktion von Hauptinhalt aus HTML durch leichte Sprachmodelle» arXiv:2511.23119 | GitHub
  • ReaderLM-v2 — Jina AI (2025). «ReaderLM-v2: HTML-zu-Markdown-Konvertierung durch ein kleines Sprachmodell» arXiv:2503.01151 | HuggingFace
  • Boilerpipe — Kohlschutter, C., Fankhauser, P., Nejdl, W. (2010). «Boilerplate-Text-Erkennung mit oberflächlichen Textmerkmalen» WSDM 2010.
  • BoilerNet — Leonhardt, J., Anand, A., Khosla, M. (2020). «Boilerplate-Entfernung mit neuronalen Sequenzmodellen und Token-Labels» WWW 2020 Companion.
  • Web2Text — Vogels, T., Ganea, O.E., Eickhoff, C. (2018). «Web2Text: Tiefgehende und strukturierte Boilerplate-Text-Entfernung» ECIR 2018.

Bestehende Benchmarks

  • Bevendorff et al. — Bevendorff, J., Gupta, S., Kiesel, J., Stein, B. (2023). «Ein empirischer Vergleich von Web-Content-Extraktionsalgorithmen» SIGIR 2023. GitHub — Integriert 8 Datensätze, umfasst etwa 3.100 Seiten. Die Erkenntnis, dass «die Leistung stark vom Genre abhängt», war eine direkte Motivation für die Erstellung von WCXB.
  • ScrapingHub — (2019). Benchmark zur Artikelextraktion. 181 Artikel-Seiten. GitHub
  • CleanEval — Baroni, M., et al. (2008). «CleanEval: Ein Wettbewerb zur Bereinigung von Webseiten» LREC 2008. 797 Seiten.
  • WebMainBench — OpenDataLab (2025). 7.809 Seiten mit Tag-level Annotationen. Der veröffentlichte Evaluierungssubdatensatz umfasst 100 Seiten. GitHub
  • L3S-GN1 — Kohlschutter, C., Nejdl, W. (2008). «Ein dichtebasierter Ansatz zur Segmentierung von Webseiten» CIKM 2008. 621 Nachrichtenseiten.

Diese Benchmark

Murrough Foley
Let's Connect

Find me on LinkedIn or X.