API Reference¶
This page summarizes the public objects exported from onecrawler. The guide pages explain when and why to use them; this page is for quick lookup.
Public imports
User-facing code should prefer from onecrawler import .... Internal classes such as runtime helpers should be imported from their concrete modules only when you are extending OneCrawler itself.
from onecrawler import (
BrowserSettings,
ContextSettings,
CrawlerSettings,
GenerativeAISettings,
HumanBehaviorSettings,
LinkExtractionEngine,
Pipeline,
ProxySettings,
ScraperEngine,
SiteMap,
SitemapStats,
UniversalSiteMap,
)
CrawlerSettings¶
Central settings for sitemap discovery, link extraction, and scraping.
Important fields:
| Field | Purpose |
|---|---|
link_extraction_strategy | deep or shallow browser discovery |
link_extraction_limit | Maximum number of URLs returned |
include_link_patterns | URL path allow-list |
scraping_strategy | heuristic or genai |
scraping_output_format | Output format for scraper results |
concurrency | Async worker count |
request_timeout | Timeout in seconds |
max_retries | Retry attempts |
proxy | Single package-level proxy |
proxies | Rotating proxy pool |
proxy_rotation_method | round_robin or random |
browser_settings | Playwright launch and context settings |
genai | GenAI provider, model, key, and optional schema |
settings = CrawlerSettings(
link_extraction_limit=200,
include_link_patterns=["/news/*"],
concurrency=8,
)
UniversalSiteMap¶
High-level sitemap resolver. It checks robots.txt, common sitemap paths, nested sitemap indexes, compressed XML, feeds, and optional HTML fallback.
Returns a list of URL strings.
Use this before browser crawling whenever possible.
Sitemaps are the cheapest discovery path
UniversalSiteMap avoids opening browser pages for discovery. Use it first for public sites, then fall back to browser extraction only when coverage is missing.
SiteMap¶
Lower-level sitemap parser that fetches and parses a direct sitemap URL. Most users should prefer UniversalSiteMap, which includes discovery and fallback behavior.
SitemapStats¶
Statistics object used by sitemap parsing. It tracks discovered URL count, parsed sitemap count, error count, elapsed time, and URL rate.
LinkExtractionEngine¶
Async browser engine for extracting links from a starting URL.
async with LinkExtractionEngine(settings) as engine:
links = await engine.run("https://example.com/docs")
Returns a list of URL strings. The engine owns its browser lifecycle inside the async context manager.
Scope browser crawling
Use link_extraction_limit and include_link_patterns with browser crawling, especially when link_extraction_strategy="deep".
ScraperEngine¶
Async scraping engine for one URL or a list of URLs.
async with ScraperEngine(settings) as scraper:
item = await scraper.run("https://example.com/story")
async with ScraperEngine(settings) as scraper:
items = await scraper.run([
"https://example.com/story-1",
"https://example.com/story-2",
])
For a single URL, returns one result or None. For a list, returns a list of successful results.
List results omit failures
When scraping a list, failed or empty extractions are filtered out. Keep your original URL list if you need to reconcile successes and failures.
GenerativeAISettings¶
Settings for model-assisted extraction. Required when scraping_strategy="genai".
settings = GenerativeAISettings(
provider="openai", # Options: "openai", "google", "ollama"
model_name="gpt-4o-mini",
api_key="YOUR_API_KEY", # Required for OpenAI/Google, optional for Ollama
output_schema=MyPydanticModel, # Pydantic model for structured output
base_url=None, # Optional: custom endpoint (e.g., Ollama instance)
reasoning=False, # Optional: enable reasoning for supported models
)
Fields:
| Field | Type | Required | Purpose |
|---|---|---|---|
provider | str | Yes | Model provider: "openai", "google", or "ollama" |
model_name | str | Yes | Model identifier (e.g., "gpt-4o-mini", "llama3:8b") |
api_key | str | Conditional | API key for OpenAI/Google, optional for Ollama |
output_schema | BaseModel | Conditional | Pydantic model for structured output |
base_url | str | Optional | Custom endpoint URL (required for Ollama) |
reasoning | bool | No | Enable reasoning for supported models |
Provider-Specific Requirements¶
OpenAI¶
api_keyrequired- Supports GPT models (gpt-3.5-turbo, gpt-4, gpt-4o, etc.)
- No
base_urlneeded (uses default OpenAI endpoint)
Google¶
api_keyrequired- Supports Gemini models (gemini-pro, gemini-1.5-pro, etc.)
- No
base_urlneeded (uses default Google endpoint)
Ollama¶
base_urlrequired (e.g.,"http://localhost:11434/")api_keyoptional- Supports local models (llama3, mistral, codellama, etc.)
- Must have Ollama server running with the specified model
Model names change over time
Check your provider's current model list before publishing examples or running scheduled jobs. Keep model identifiers configurable in production.
BrowserSettings¶
Top-level browser settings. It contains launch, context, runtime, and proxy settings.
settings = CrawlerSettings(
browser_settings=BrowserSettings(
context=ContextSettings(viewport={"width": 1366, "height": 768})
)
)
Use browser settings for custom viewport, user agent, proxy, locale, timezone, storage state, HTTPS behavior, and Playwright runtime timeouts.
ProxySettings¶
Proxy settings for browser and sitemap workflows.
settings = CrawlerSettings(
proxies=[
ProxySettings(server="http://proxy-1.example:8080"),
ProxySettings(
server="http://proxy-2.example:8080",
username="user",
password="pass",
),
],
proxy_rotation_method="round_robin",
)
Use proxy=ProxySettings(...) for one proxy. Use proxies=[...] for a rotating pool. Supported rotation strategies are round_robin and random.
HumanBehaviorSettings¶
Delay, scroll, and mouse movement settings for optional browser behavior simulation.
settings = CrawlerSettings(
enable_human_behaviors=True,
human_behavior_settings=HumanBehaviorSettings(max_scrolls=20),
)
This affects deep browser link extraction. It is useful for lazy-loaded links but reduces throughput.
Use only where needed
Human behavior simulation is helpful for lazy-loaded pages, but it should not be a default for every crawl.
Pipeline¶
A comprehensive web crawling pipeline that orchestrates browser automation, link extraction, and content scraping in a single unified workflow.
Proxy configuration is required for production
Pipeline performs browser discovery and content extraction together. Use explicit proxy settings and conservative concurrency for production runs.
# Basic usage
settings = CrawlerSettings(
link_extraction_limit=100,
concurrency=5,
proxies=[ProxySettings(server="http://proxy.example.com:8080")]
)
async with Pipeline(settings) as engine:
results = await engine.run("https://example.com")
# With date filtering
async with Pipeline(settings, start_date="2024-01-01", end_date="2024-12-31") as engine:
results = await engine.run("https://example.com")
Returns a list of content dictionaries with extracted data from discovered pages.
Key Features¶
- Orchestrated Workflow: Combines link discovery, browser automation, and content extraction
- Date Filtering: Filter content by publication date range
- Human Behavior Simulation: Optional realistic browsing patterns
- Proxy Support: Built-in proxy rotation for production crawling
- Concurrent Processing: Configurable worker pool for efficient crawling
Constructor Parameters¶
| Parameter | Type | Required | Default | Purpose |
|---|---|---|---|---|
settings | CrawlerSettings | Yes | - | Configuration for all crawling components |
start_date | str | No | None | Filter content from this date (YYYY-MM-DD) |
end_date | str | No | None | Filter content until this date (YYYY-MM-DD) |
Proxy Configuration¶
Required for production use:
settings = CrawlerSettings(
proxies=[
ProxySettings(server="http://proxy1.example.com:8080"),
ProxySettings(server="http://proxy2.example.com:8080"),
],
proxy_rotation_method="round_robin",
)
Without proper proxy configuration, your crawler may be blocked by target websites.
Date filtering depends on extracted metadata
start_date and end_date work when extracted content includes a filedate or date field in YYYY-MM-DD format.
Usage Patterns¶
Simple crawling:
With date filtering:
async with Pipeline(settings,
start_date="2024-01-01",
end_date="2024-06-30") as engine:
content = await engine.run("https://example.com/news")
Manual lifecycle:
engine = Pipeline(settings)
await engine.start()
try:
content = await engine.run("https://example.com")
finally:
await engine.close()
LinkClassifierPipeline¶
Publicly exported link classifier pipeline used by shallow extraction when link_classification=True. Most users should start with include_link_patterns because path filters are explicit, easy to debug, and deterministic.