Skip to content

OneCrawler

PyPI CI Coverage License Python

OneCrawler is an async Python crawling framework for discovering URLs, extracting links, and scraping structured content.

The framework is designed around a practical crawler workflow:

  1. Discover candidate URLs from sitemaps or browser-based link traversal.
  2. Filter and limit those URLs to the section you actually care about.
  3. Scrape the selected pages with either heuristic extraction or structured GenAI extraction.
  4. Save the result in the format your application needs.

Recommended first path

Start with sitemap discovery, store the URL list, then scrape those URLs in a separate step. This gives you cleaner retries and easier debugging.

Why OneCrawler Exists

Many crawling projects start as small scripts and become hard to maintain once they need retries, concurrency, browser rendering, sitemap fallback, URL filtering, and structured extraction. OneCrawler gives those concerns a shared settings model and a few focused async engines so production scripts stay readable.

Crawling needs boundaries

Always set reasonable limits, filters, retries, and concurrency. Clear crawl boundaries protect both your job and the target site.

Use it when you are building:

  • content ingestion pipelines for news, blogs, catalogs, or documentation sites
  • search indexing jobs
  • data collection scripts for analysis or monitoring
  • internal tools that need repeatable URL discovery and content extraction
  • prototypes that may later become scheduled crawler jobs

Use GenAI for semantics, not bulk text

Heuristic extraction is the better default for large text extraction jobs. Use GenAI when you need summaries, classification, field normalization, or a typed schema.

Start with sitemap discovery whenever possible. A sitemap is usually the fastest, cleanest, and least expensive way to collect URLs because it avoids opening many browser pages just to discover links.

import asyncio

from onecrawler import CrawlerSettings, UniversalSiteMap


async def main():
    settings = CrawlerSettings(
        link_extraction_limit=500,
        include_link_patterns=["/news/*"],
        concurrency=10,
    )

    sitemap = UniversalSiteMap(settings)
    urls = await sitemap.run("https://example.com")
    print(f"Found {len(urls)} URLs")


if __name__ == "__main__":
    asyncio.run(main())

If the site has no useful sitemap, use LinkExtractionEngine:

  • shallow for links present on one page
  • deep for recursive same-site traversal

Then pass the final URL list to ScraperEngine.

Choosing The Right Tool

Goal Recommended feature Why
Collect most public URLs quickly UniversalSiteMap Uses robots.txt, common sitemap paths, nested sitemap indexes, and optional HTML fallback
Inspect one listing page shallow link extraction Lower crawl cost and easier to reason about
Explore a site section recursively deep link extraction Follows internal links until your settings limit
Extract readable article text heuristic scraping Fast, deterministic, and does not require model calls
Produce strongly typed output GenAI scraping with a Pydantic schema Best fit when downstream systems require a stable structured shape
Avoid noisy crawls include_link_patterns Keeps discovery focused on URL paths you trust

Documentation Map

Production Principles

Prefer sitemaps before crawling pages. They are faster, friendlier to target sites, and usually more complete than what a browser can discover from navigation pages.

Constrain every job. Set link_extraction_limit, include_link_patterns, concurrency, request_timeout, and max_retries explicitly so a crawl behaves predictably when the target site changes.

Use heuristic scraping by default. It is cheaper and more repeatable. Move to GenAI when you need semantic interpretation, field normalization, or structured output in a predefined Pydantic schema.

Treat browser crawling as the heavier tool. It is useful for JavaScript-rendered pages and dynamic link discovery, but it has more moving parts than sitemap parsing or direct HTTP fetching.