Quick Start¶
This guide shows the fastest useful paths through OneCrawler. The examples are small enough to paste into a script, but they use the same structure you would keep in a scheduled production job.
Install¶
Install the Playwright browser only when you plan to use browser-backed link extraction or browser-backed scraping.
Start with the lightest workflow
If a site has a sitemap, use UniversalSiteMap before opening browser pages. Browser workflows are more flexible, but they cost more time, memory, and operational care.
Best First Workflow: Sitemap Then Scrape¶
Use this pattern when the target site publishes a sitemap. It is the preferred starting point for most crawls because it avoids unnecessary browser navigation.
import json
import asyncio
from onecrawler import CrawlerSettings, ScraperEngine, UniversalSiteMap
async def main():
settings = CrawlerSettings(
link_extraction_limit=100,
include_link_patterns=["/articles/*"],
scraping_strategy="heuristic",
scraping_output_format="json",
concurrency=8,
request_timeout=15,
max_retries=3,
)
sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")
async with ScraperEngine(settings) as scraper:
records = await scraper.run(urls)
with open("articles.json", "w", encoding="utf-8") as f:
json.dump(records, f, indent=2, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(main())
This workflow is ideal for news sections, blogs, documentation sites, catalogs, and any site that exposes stable URL metadata.
Save discovered URLs
For production jobs, persist the URL list before scraping. It makes retries much easier because failed extraction can be rerun without repeating discovery.
Browser Discovery Workflow¶
Use browser-based link extraction when the site does not publish useful sitemaps, or when links are generated by JavaScript.
import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine
async def main():
settings = CrawlerSettings(
link_extraction_strategy="deep",
link_extraction_limit=250,
include_link_patterns=["/news/*"],
concurrency=5,
)
async with LinkExtractionEngine(settings) as engine:
links = await engine.run("https://example.com/news")
print(f"Collected {len(links)} links")
if __name__ == "__main__":
asyncio.run(main())
Choose shallow when the start page is a listing page and you only need its direct links. Choose deep when you need recursive traversal within the same site.
Keep deep crawls scoped
Always combine deep crawling with link_extraction_limit and include_link_patterns. A broad start page can quickly discover login, search, tag, archive, and policy URLs you did not intend to scrape.
Structured GenAI Workflow¶
Use GenAI extraction when plain article text is not enough and your application needs fields in a predefined shape.
import asyncio
from typing import Optional
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine
class ArticleSummary(BaseModel):
title: str
author: Optional[str] = None
published_at: Optional[str] = None
summary: str
topics: list[str]
async def main():
settings = CrawlerSettings(
scraping_strategy="genai",
scraping_output_format="json",
genai=GenerativeAISettings(
provider="openai",
model_name="gpt-4o-mini",
api_key="YOUR_API_KEY",
output_schema=ArticleSummary,
),
concurrency=3,
request_timeout=30,
)
async with ScraperEngine(settings) as scraper:
result = await scraper.run("https://example.com/articles/story")
print(result)
if __name__ == "__main__":
asyncio.run(main())
GenAI extraction should be reserved for semantic or typed output needs. For high volume article text extraction, start with the heuristic strategy and add GenAI only for pages that require interpretation or normalization.
Use GenAI selectively
A good pattern is heuristic extraction first, then GenAI only for records that need classification, summarization, field normalization, or schema-shaped output.
All-in-One Pipeline: Pipeline¶
Use Pipeline when you want a single, orchestrated workflow that combines link discovery, browser automation, and content extraction.
Use proxies for production pipeline crawls
Pipeline opens browser pages and extracts content as it discovers links. For production crawls, configure a single proxy or proxy pool and keep concurrency conservative.
import json
import asyncio
from onecrawler import Pipeline, CrawlerSettings, ProxySettings
async def main():
settings = CrawlerSettings(
link_extraction_limit=50,
include_link_patterns=["/news/*"],
concurrency=5,
# Required for production
proxies=[
ProxySettings(server="http://proxy1.example.com:8080"),
ProxySettings(server="http://proxy2.example.com:8080"),
],
proxy_rotation_method="round_robin",
)
async with Pipeline(settings) as engine:
results = await engine.run("https://example.com/news")
with open("news_articles.json", "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
With date filtering:
async with Pipeline(
settings,
start_date="2024-01-01",
end_date="2024-12-31"
) as engine:
results = await engine.run("https://example.com/news")
For JavaScript-heavy sites:
from onecrawler import HumanBehaviorSettings
settings = CrawlerSettings(
link_extraction_limit=30,
concurrency=3,
enable_human_behaviors=True,
human_behavior_settings=HumanBehaviorSettings(
max_scrolls=3,
min_delay=1.0,
max_delay=2.0,
),
proxies=[...], # Always required for production
)
async with Pipeline(settings) as engine:
results = await engine.run("https://spa-example.com")
Practical Defaults¶
Start conservative:
concurrency=5to10for discoveryconcurrency=2to5for GenAI extractionrequest_timeout=10to30depending on target speedmax_retries=2or3for unstable sitesinclude_link_patternson every broad crawl
Then increase throughput only after you know the target site responds reliably.
High concurrency can reduce success
More workers are not always faster. Browser pages, proxy limits, server rate limits, and model provider limits can all turn high concurrency into retries and empty results.