OneCrawler¶

OneCrawler is an async Python crawling framework for discovering URLs, extracting links, and scraping structured content.

The framework is designed around a practical crawler workflow:

Discover candidate URLs from sitemaps or browser-based link traversal.
Filter and limit those URLs to the section you actually care about.
Scrape the selected pages with either heuristic extraction or structured GenAI extraction.
Save the result in the format your application needs.

Recommended first path

Start with sitemap discovery, store the URL list, then scrape those URLs in a separate step. This gives you cleaner retries and easier debugging.

Why OneCrawler Exists¶

Many crawling projects start as small scripts and become hard to maintain once they need retries, concurrency, browser rendering, sitemap fallback, URL filtering, and structured extraction. OneCrawler gives those concerns a shared settings model and a few focused async engines so production scripts stay readable.

Crawling needs boundaries

Always set reasonable limits, filters, retries, and concurrency. Clear crawl boundaries protect both your job and the target site.

Use it when you are building:

content ingestion pipelines for news, blogs, catalogs, or documentation sites
search indexing jobs
data collection scripts for analysis or monitoring
internal tools that need repeatable URL discovery and content extraction
prototypes that may later become scheduled crawler jobs

Use GenAI for semantics, not bulk text

Heuristic extraction is the better default for large text extraction jobs. Use GenAI when you need summaries, classification, field normalization, or a typed schema.

Recommended Workflow¶

Start with sitemap discovery whenever possible. A sitemap is usually the fastest, cleanest, and least expensive way to collect URLs because it avoids opening many browser pages just to discover links.

import asyncio

from onecrawler import CrawlerSettings, UniversalSiteMap


async def main():
    settings = CrawlerSettings(
        link_extraction_limit=500,
        include_link_patterns=["/news/*"],
        concurrency=10,
    )

    sitemap = UniversalSiteMap(settings)
    urls = await sitemap.run("https://example.com")
    print(f"Found {len(urls)} URLs")


if __name__ == "__main__":
    asyncio.run(main())

If the site has no useful sitemap, use LinkExtractionEngine:

shallow for links present on one page
deep for recursive same-site traversal

Then pass the final URL list to ScraperEngine.

Choosing The Right Tool¶

Goal	Recommended feature	Why
Collect most public URLs quickly	`UniversalSiteMap`	Uses `robots.txt`, common sitemap paths, nested sitemap indexes, and optional HTML fallback
Inspect one listing page	shallow link extraction	Lower crawl cost and easier to reason about
Explore a site section recursively	deep link extraction	Follows internal links until your settings limit
Extract readable article text	heuristic scraping	Fast, deterministic, and does not require model calls
Produce strongly typed output	GenAI scraping with a Pydantic schema	Best fit when downstream systems require a stable structured shape
Avoid noisy crawls	`include_link_patterns`	Keeps discovery focused on URL paths you trust

Documentation Map¶

Installation: package setup, browser requirements, optional extras
Quick start: first complete discovery and scraping workflows
Configuration: crawler settings and configuration
Link Extraction: LinkExtractionEngine and link discovery
Sitemap Discovery: UniversalSiteMap and URL collection
Scraping Engine: ScraperEngine and content extraction
Settings Configuration: CrawlerSettings and configuration classes
API reference: public classes exported from onecrawler
Troubleshooting: common failures and fixes

Production Principles¶

Prefer sitemaps before crawling pages. They are faster, friendlier to target sites, and usually more complete than what a browser can discover from navigation pages.

Constrain every job. Set link_extraction_limit, include_link_patterns, concurrency, request_timeout, and max_retries explicitly so a crawl behaves predictably when the target site changes.

Use heuristic scraping by default. It is cheaper and more repeatable. Move to GenAI when you need semantic interpretation, field normalization, or structured output in a predefined Pydantic schema.

Treat browser crawling as the heavier tool. It is useful for JavaScript-rendered pages and dynamic link discovery, but it has more moving parts than sitemap parsing or direct HTTP fetching.