rs-trafilatura: A Rust Web Content Extraction Library

Murrough Foley
Murrough Foley
Author·15 min read

I built rs-trafilatura because I needed to extract clean content from web pages at scale — not just articles, but product pages, forums, documentation, landing pages, the whole mix. The existing tools were either Python-only (too slow for my pipeline), or Rust crates that handled article pages well but fell apart on anything else.

It started as a straightforward port of Adrien Barbaresi's excellent Trafilatura Python library to Rust. I needed the speed — processing thousands of pages through a Python pipeline was a bottleneck in my SEO analysis workflow. But as I tested across different page types, I kept hitting the same problem: one-size-fits-all extraction doesn't work on the modern web.

So it grew. Page type classification. Type-specific extraction profiles. A confidence score. Markdown output. An ML classifier when the rule-based heuristics hit their ceiling.

What It Does

rs-trafilatura extracts the main content from any web page — title, author, date, and the full article body — while stripping navigation, ads, sidebars, cookie banners, and other boilerplate. What makes it different from other extraction libraries:

  • Page type classification: A three-stage classifier (URL heuristics, HTML signal analysis, XGBoost with 181 features) detects 7 page types at 87% accuracy — article, forum, product, collection, listing, documentation, and service. Each type gets a specialised extraction profile.

  • Extraction confidence: Every extraction includes a quality score (0.0-1.0) from a 27-feature XGBoost regression model that predicts the expected F1 score. Pages scoring below 0.80 are candidates for LLM fallback — more on this below.

  • Markdown output: GitHub Flavored Markdown preserving headings, links, tables, bold/italic, code blocks, and blockquotes. This is what I actually need for SEO analysis — the structural signals matter as much as the text.

  • Speed: 44ms per page on CPU. That's 22 pages per second on commodity hardware, compared to 1,570ms per page for MinerU-HTML on an A100 GPU.

use rs_trafilatura::{extract_with_options, Options};

let html = std::fs::read_to_string("page.html")?;
let result = extract_with_options(&html, &Options::default())?;

println!("Title: {:?}", result.metadata.title);
println!("Author: {:?}", result.metadata.author);
println!("Content: {}", result.content_text);
println!("Page type: {:?}", result.metadata.page_type);
println!("Confidence: {:.2}", result.extraction_quality);
import rs_trafilatura

html = open("page.html").read()
result = rs_trafilatura.extract(html, url="https://example.com")

print(f"Title: {result.title}")
print(f"Content: {result.main_content[:200]}...")
print(f"Page type: {result.page_type}")
print(f"Confidence: {result.extraction_quality:.2f}")

The Python package bundles four Rust crates into a single native extension via PyO3 — no subprocess overhead, just compiled Rust called directly from Python.

Who Is This For?

Honestly, I built this for myself. I wanted to understand what Google sees when it evaluates a web page, and to do that I needed to extract clean, structured content from thousands of pages across search results — not just articles, but the full mix of page types you find in a real SERP. The existing tools didn't cut it for that, so I built one that did.

But content extraction at scale turns out to be a problem a lot of people are trying to solve — and some of them have turned it into serious businesses:

Search and data companies are selling extracted web content as a service. Jina AI built ReaderLM-v2 and their Reader API specifically for this — give them a URL, get back clean markdown. Firecrawl does the same with JS rendering and anti-bot handling. Tavily packages search + extraction as an API for AI agents. Diffbot has been doing structured web data extraction for over a decade. Zyte (formerly ScrapingHub — the same team behind the article extraction benchmark) sell extraction at enterprise scale. The market exists because the problem is hard and everybody needs it solved.

RAG and LLM pipeline builders need clean content to feed to embedding models and context windows. Boilerplate in your retrieval context means wasted tokens and worse answers. Every RAG pipeline has an extraction step, and most of them are using Readability or BeautifulSoup and getting mediocre results on non-article pages.

LLM training data teams process billions of web pages. The quality of the extraction directly impacts the quality of the model. Common Crawl provides the raw HTML — but turning that into clean training text is where extraction quality matters. A few percentage points of F1 across billions of pages means millions of pages of noise removed or real content preserved. The team behind MinerU-HTML demonstrated one approach to this with their AICC corpus — clustering Common Crawl pages by DOM template similarity, running an LLM on one representative page per cluster, then distilling those decisions into lightweight rules applied to the remaining 99.6% of pages. Smart architecture.

SEO practitioners use extraction to approximate what search engines see. Content audits, competitor analysis, SERP quality scoring, content gap analysis — all of these start with extracting the actual content from a page. If your extraction tool is including navigation menus and cookie banners in the "content," your analysis is wrong before you've started.

Academic researchers in information retrieval, web mining, and NLP need reproducible extraction for experiments. The WCXB benchmark exists partly because I couldn't find a decent benchmark that tested extraction across page types — and I'm sure I wasn't the only one looking.

If you process web pages at scale and care about what you're actually extracting, this is for you.

Benchmark Results

I benchmarked rs-trafilatura against other extraction systems on two datasets: the ScrapingHub benchmark (181 articles) and the WCXB benchmark I built (2,008 pages across 7 page types, split into a 1,497-page dev set and 511-page held-out test set).

ScrapingHub Benchmark (181 articles)

LibraryF1 ScorePrecisionRecall
rs-trafilatura0.9660.9420.991
trafilatura (Python)0.9480.9330.983

WCXB Development Set (1,497 pages, 7 page types)

LibraryF1 ScorePrecisionRecall
rs-trafilatura0.8590.8630.890
MinerU-HTML (0.6B LLM)0.8270.8450.840
Trafilatura (Python)0.7910.8520.793
dom-smoothie0.7620.8060.768
ReaderLM-v2 (1.5B LLM)0.7410.7410.790

WCXB Held-Out Test Set (511 pages, never used during development)

LibraryF1 ScorePrecisionRecall
rs-trafilatura0.8930.9000.910
Trafilatura (Python)0.8330.8860.828

rs-trafilatura outperforms the original Python implementation by 6.8 F1 points on the development set and 6.0 points on the held-out set. It also beats both LLM-based extractors while running at 44ms per page on CPU — compared to 1,570ms (MinerU-HTML) and 10,410ms (ReaderLM-v2) on an A100 GPU.

How Rust Extraction Crates Compare

For context, here's how the main Rust extraction crates stack up on the WCXB benchmark:

Main content extractors (filter boilerplate):

LibraryF1 ScoreBest For
rs-trafilatura0.859Complete extraction with metadata, 7 page types
dom_smoothie0.762Readability-style extraction
dom-content-extraction0.731CETD algorithm, research-backed

Full text extractors (extract everything, no filtering):

LibraryF1 ScoreTrade-off
nanohtml2text0.670Fast (606us) but includes boilerplate
fast_html2md0.664Markdown output, includes boilerplate

The distinction matters: full text extractors capture everything including navigation and footers. Main content extractors identify and return only the article body. For SEO analysis, content aggregation, or LLM training data, you want the latter.

A note on dom_smoothie: it's an excellent crate and one I studied closely while building rs-trafilatura. It's also faster — 27ms/page vs rs-trafilatura's 44ms — because it's doing Readability-style extraction without the classification and type-specific profile overhead. If you're working primarily with articles and blogs, dom_smoothie is a strong choice. rs-trafilatura is slower because it's classifying the page type and applying different extraction logic on every page — that's the tradeoff for handling non-article content.

Both dom_smoothie and rs-trafilatura are built on niklak's dom_query — a DOM manipulation library for Rust that makes CSS selector-based traversal straightforward. It's one of those foundational crates that doesn't get enough credit.

For a deep dive into Rust HTML parsing crates (html5ever, scraper, select.rs, dom_query), I'd recommend Evan Schwartz's comprehensive comparison of 13 Rust crates for extracting text from HTML.

Why Page Types Matter

This is the part that surprised me most. On articles, every extractor converges — F1 between 0.88 and 0.93. On everything else, they diverge wildly:

Page TypeNrs-trafilaturaMinerU-HTMLTrafilaturaReaderLM-v2
Article7930.9320.9280.9260.878
Documentation910.9310.8380.8880.776
Service1650.8430.8240.7630.703
Forum1130.7920.7940.5850.589
Collection1170.7130.5060.5530.417
Listing990.7040.7100.5890.559
Product1190.6700.6190.5670.463

Forums: a 33-point spread between best and worst. Collections: 30 points. Products: 26 points. Article-only benchmarks hide all of this.

The reason is structural. A forum thread's user posts match class="comment" — which most extractors treat as boilerplate. A service page distributes content across 10 different <section> elements — single-node extraction captures one and misses the rest. A product page stores its description in JSON-LD structured data — invisible to DOM-only extractors.

rs-trafilatura handles these by classifying the page first, then applying a type-specific extraction strategy. Forum profiles treat comment elements as content. Service page profiles merge content from multiple DOM sections. Product profiles fall back to JSON-LD when DOM extraction fails.

Confidence Scoring and Hybrid Pipelines

The extraction quality predictor is the feature I'm most interested in developing further. Right now it's a 27-feature XGBoost model that looks at signals available at extraction time — extraction-to-HTML ratio, paragraph structure, link density, content length relative to page type expectations, boilerplate keywords in the opening text. At a threshold of 0.80, it correctly flags about 35% of poorly-extracted pages while maintaining 97% precision on the rest.

The practical use is hybrid pipelines. Run rs-trafilatura on everything at 44ms/page. For the ~8% of pages where confidence is low, route them to a neural extractor. On the WCXB held-out test set, this pushes F1 from 0.893 to 0.910 — the best of both worlds.

But the routing has to be page-type-aware. MinerU-HTML helps on articles, forums, and service pages, but actually performs worse than rs-trafilatura on collections (0.506 vs 0.713) and products (0.619 vs 0.670). Sending low-confidence collection pages to MinerU-HTML makes things worse, not better.

Where I'd like to take this: instead of routing all low-confidence pages to a single general-purpose LLM, route them to page-type-specific models trained for that exact extraction task. A small model fine-tuned specifically on product page extraction. Another trained on forum threads. The page type classifier already tells you which model to call — the infrastructure just needs specialised models on the other end. That's the architecture I think wins long-term: fast heuristics for 90%+ of pages, specialised neural models for the page types where heuristics hit their ceiling.

I haven't built this yet. But the confidence scorer and page type classifier are the foundation for it.

Headings, SEO, and the Limits of HTML Extraction

An interesting finding from the benchmark work: only a good portion of visually prominent section titles on web pages use semantic <h1>-<h6> tags. The rest use <strong>, styled <span> elements, or CSS font-weight: bold to create visual headings without semantic markup.

This matters for both SEO and extraction:

  • Google's Heading Vector Patent assigns numerical vectors to <h1>-<h6> headings for topic understanding.
  • Google's Page Segmentation Patent describes pseudo-rendering pages to detect visually prominent text regardless of HTML tags.
  • HTML-only extraction (what rs-trafilatura and all heuristic extractors do) can only detect headings that use <h> tags. A <strong>Section Title</strong> becomes bold text, not a heading.

If you control the HTML: always use semantic heading tags. A <strong> section title that looks like a heading is invisible to most extraction tools and provides weaker signals to search engines.

For extraction pipelines: this gap is a theoretical ceiling for any pure HTML-to-markdown converter — including rs-trafilatura. A <strong>Section Title</strong> will come through as bold text in the markdown output, not as a heading. There's no way around this without visual rendering or LLM interpretation. It's one more argument for hybrid pipelines that combine fast heuristic extraction with LLM fallback on pages where structural detection is uncertain.

What Still Needs Work

I want to be upfront about where this isn't good enough yet.

Page type classification sits at 87%. That means roughly 1 in 8 pages gets the wrong extraction profile. For articles, forums, and products, accuracy is in the low 90s — fine. But listings (53% recall) and service pages (71%) get misclassified too often, usually as articles. I'd like to push overall accuracy into the low 90s, which probably means better features for distinguishing editorial content from content indexes.

Product extraction (F1 = 0.670) and collection pages (0.713) aren't where I want them. Products are hard because so much content lives in JSON-LD and tabbed interfaces. Collections are hard because the "content" is a mix of product cards and filter UI. These are the page types where the heuristic approach hits its ceiling hardest.

This is a side project. I'm an SEO consultant — paid work takes precedence, and there are weeks where rs-trafilatura gets no attention at all. Progress comes in bursts when I have time between client projects. I mention this not as an excuse but so expectations are calibrated — this isn't backed by a team or a company, it's one person working on it when he can.

If any of these limitations are dealbreakers for your use case, MinerU-HTML is a strong alternative for article-heavy workloads, and Firecrawl handles the infrastructure side if you'd rather not run your own extraction.

Getting Started

Rust — install from crates.io:

[dependencies]
rs-trafilatura = "0.2"

Python — install from PyPI:

pip install rs-trafilatura

Or use the CLI binary:

curl -s https://example.com/article | extract_stdin

The output is JSON with title, author, date, main content, page type, classification confidence, and extraction confidence.


Citations & References

Google Patents on Content Extraction & Page Segmentation

  • Page Segmentation — Mehta, B., et al. (2011). "Segmenting Web Pages." US Patent 7,930,307. Google. — Describes pseudo-rendering pages to identify visually distinct content regions, independent of HTML tag structure. SEO by the Sea analysis
  • Heading Vectors — Google patent on assigning numerical vectors to <h1><h6> headings for topical understanding and document structure analysis. MarketBrew analysis
  • DOM Distiller — Google's production content extraction system powering Chrome Reader Mode. Based on Boilerpipe with Readability-style fallback. Source code

Foundational Extraction Research

  • Boilerpipe — Kohlschutter, C., Fankhauser, P., Nejdl, W. (2010). "Boilerplate Detection using Shallow Text Features." WSDM 2010. — The foundational paper on text-density-based content extraction. Treats extraction as binary classification of text blocks.
  • Trafilatura — Barbaresi, A. (2021). "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Retrieval." ACL 2021. GitHub — The most widely-used Python extraction library. rs-trafilatura began as a port of this.
  • Readability — Mozilla (2010). The algorithm behind Firefox Reader View. Scores DOM nodes by text density and link ratio. GitHub
  • jusText — Pomikalek, J. (2011). "Removing Boilerplate and Duplicate Content from Web Corpora." PhD thesis, Masaryk University. GitHub

Neural / LLM Extraction

  • MinerU-HTML (Dripper) — Liu, M., et al. (2025). "Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM." arXiv:2511.23119 | GitHub — Fine-tunes Qwen3-0.6B for binary element classification.
  • ReaderLM-v2 — Jina AI (2025). "ReaderLM-v2: HTML to Markdown with a Small Language Model." arXiv:2503.01151 | HuggingFace — 1.5B parameter model that generates Markdown directly from HTML.
  • BoilerNet — Leonhardt, J., Anand, A., Khosla, M. (2020). "Boilerplate Removal using a Neural Sequence Labeling Model." WWW 2020 Companion.
  • Web2Text — Vogels, T., Ganea, O.E., Eickhoff, C. (2018). "Web2Text: Deep Structured Boilerplate Removal." ECIR 2018.

Comparative Studies & Benchmarks

  • Bevendorff et al. — Bevendorff, J., Gupta, S., Kiesel, J., Stein, B. (2023). "An Empirical Comparison of Web Content Extraction Algorithms." SIGIR 2023. GitHub — The most comprehensive existing comparison. Combined 8 datasets, evaluated 14 extractors. Key finding: "performance is quite genre-dependent."
  • ScrapingHub — (2019). Article Extraction Benchmark. 181 pages. GitHub
  • Evan Schwartz — (2024). "Comparing 13 Rust Crates for Extracting Text from HTML." Blog — Comprehensive comparison of Rust HTML parsing options.

This Work


Murrough Foley (ORCID: 0009-0008-3127-2101) is an SEO consultant and the author of rs-trafilatura. The WCXB benchmark is available with DOI 10.5281/zenodo.19316874.

Murrough Foley
Let's Connect

Find me on LinkedIn or X.