OneCrawler¶
OneCrawler is an async Python crawling framework for discovering URLs, extracting links, and scraping structured content.
The framework is designed around a practical crawler workflow:
- Discover candidate URLs from sitemaps or browser-based link traversal.
- Filter and limit those URLs to the section you actually care about.
- Scrape the selected pages with either heuristic extraction or structured GenAI extraction.
- Save the result in the format your application needs.
Recommended first path
Start with sitemap discovery, store the URL list, then scrape those URLs in a separate step. This gives you cleaner retries and easier debugging.
Why OneCrawler Exists¶
Many crawling projects start as small scripts and become hard to maintain once they need retries, concurrency, browser rendering, sitemap fallback, URL filtering, and structured extraction. OneCrawler gives those concerns a shared settings model and a few focused async engines so production scripts stay readable.
Crawling needs boundaries
Always set reasonable limits, filters, retries, and concurrency. Clear crawl boundaries protect both your job and the target site.
Use it when you are building:
- content ingestion pipelines for news, blogs, catalogs, or documentation sites
- search indexing jobs
- data collection scripts for analysis or monitoring
- internal tools that need repeatable URL discovery and content extraction
- prototypes that may later become scheduled crawler jobs
Use GenAI for semantics, not bulk text
Heuristic extraction is the better default for large text extraction jobs. Use GenAI when you need summaries, classification, field normalization, or a typed schema.
Recommended Workflow¶
Start with sitemap discovery whenever possible. A sitemap is usually the fastest, cleanest, and least expensive way to collect URLs because it avoids opening many browser pages just to discover links.
import asyncio
from onecrawler import CrawlerSettings, UniversalSiteMap
async def main():
settings = CrawlerSettings(
link_extraction_limit=500,
include_link_patterns=["/news/*"],
concurrency=10,
)
sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")
print(f"Found {len(urls)} URLs")
if __name__ == "__main__":
asyncio.run(main())
If the site has no useful sitemap, use LinkExtractionEngine:
shallowfor links present on one pagedeepfor recursive same-site traversal
Then pass the final URL list to ScraperEngine.
Choosing The Right Tool¶
| Goal | Recommended feature | Why |
|---|---|---|
| Collect most public URLs quickly | UniversalSiteMap | Uses robots.txt, common sitemap paths, nested sitemap indexes, and optional HTML fallback |
| Inspect one listing page | shallow link extraction | Lower crawl cost and easier to reason about |
| Explore a site section recursively | deep link extraction | Follows internal links until your settings limit |
| Extract readable article text | heuristic scraping | Fast, deterministic, and does not require model calls |
| Produce strongly typed output | GenAI scraping with a Pydantic schema | Best fit when downstream systems require a stable structured shape |
| Avoid noisy crawls | include_link_patterns | Keeps discovery focused on URL paths you trust |
Documentation Map¶
- Installation: package setup, browser requirements, optional extras
- Quick start: first complete discovery and scraping workflows
- Configuration: crawler settings and configuration
- Link Extraction:
LinkExtractionEngineand link discovery - Sitemap Discovery:
UniversalSiteMapand URL collection - Scraping Engine:
ScraperEngineand content extraction - Settings Configuration:
CrawlerSettingsand configuration classes - API reference: public classes exported from
onecrawler - Troubleshooting: common failures and fixes
Production Principles¶
Prefer sitemaps before crawling pages. They are faster, friendlier to target sites, and usually more complete than what a browser can discover from navigation pages.
Constrain every job. Set link_extraction_limit, include_link_patterns, concurrency, request_timeout, and max_retries explicitly so a crawl behaves predictably when the target site changes.
Use heuristic scraping by default. It is cheaper and more repeatable. Move to GenAI when you need semantic interpretation, field normalization, or structured output in a predefined Pydantic schema.
Treat browser crawling as the heavier tool. It is useful for JavaScript-rendered pages and dynamic link discovery, but it has more moving parts than sitemap parsing or direct HTTP fetching.