API Reference¶

This page summarizes the public objects exported from onecrawler. The guide pages explain when and why to use them; this page is for quick lookup.

Public imports

User-facing code should prefer from onecrawler import .... Internal classes such as runtime helpers should be imported from their concrete modules only when you are extending OneCrawler itself.

from onecrawler import (
    BrowserSettings,
    ContextSettings,
    CrawlerSettings,
    GenerativeAISettings,
    HumanBehaviorSettings,
    LinkExtractionEngine,
    Pipeline,
    ProxySettings,
    ScraperEngine,
    SiteMap,
    SitemapStats,
    UniversalSiteMap,
)

CrawlerSettings¶

Central settings for sitemap discovery, link extraction, and scraping.

Important fields:

Field	Purpose
`link_extraction_strategy`	`deep` or `shallow` browser discovery
`link_extraction_limit`	Maximum number of URLs returned
`include_link_patterns`	URL path allow-list
`scraping_strategy`	`heuristic` or `genai`
`scraping_output_format`	Output format for scraper results
`concurrency`	Async worker count
`request_timeout`	Timeout in seconds
`max_retries`	Retry attempts
`proxy`	Single package-level proxy
`proxies`	Rotating proxy pool
`proxy_rotation_method`	`round_robin` or `random`
`browser_settings`	Playwright launch and context settings
`genai`	GenAI provider, model, key, and optional schema

settings = CrawlerSettings(
    link_extraction_limit=200,
    include_link_patterns=["/news/*"],
    concurrency=8,
)

UniversalSiteMap¶

High-level sitemap resolver. It checks robots.txt, common sitemap paths, nested sitemap indexes, compressed XML, feeds, and optional HTML fallback.

sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")

Returns a list of URL strings.

Use this before browser crawling whenever possible.

Sitemaps are the cheapest discovery path

UniversalSiteMap avoids opening browser pages for discovery. Use it first for public sites, then fall back to browser extraction only when coverage is missing.

SiteMap¶

Lower-level sitemap parser that fetches and parses a direct sitemap URL. Most users should prefer UniversalSiteMap, which includes discovery and fallback behavior.

sitemap = SiteMap(settings)
urls = await sitemap.run("https://example.com/sitemap.xml")

SitemapStats¶

Statistics object used by sitemap parsing. It tracks discovered URL count, parsed sitemap count, error count, elapsed time, and URL rate.

LinkExtractionEngine¶

Async browser engine for extracting links from a starting URL.

async with LinkExtractionEngine(settings) as engine:
    links = await engine.run("https://example.com/docs")

Returns a list of URL strings. The engine owns its browser lifecycle inside the async context manager.

Scope browser crawling

Use link_extraction_limit and include_link_patterns with browser crawling, especially when link_extraction_strategy="deep".

ScraperEngine¶

Async scraping engine for one URL or a list of URLs.

async with ScraperEngine(settings) as scraper:
    item = await scraper.run("https://example.com/story")

async with ScraperEngine(settings) as scraper:
    items = await scraper.run([
        "https://example.com/story-1",
        "https://example.com/story-2",
    ])

For a single URL, returns one result or None. For a list, returns a list of successful results.

List results omit failures

When scraping a list, failed or empty extractions are filtered out. Keep your original URL list if you need to reconcile successes and failures.

GenerativeAISettings¶

Settings for model-assisted extraction. Required when scraping_strategy="genai".

settings = GenerativeAISettings(
    provider="openai",  # Options: "openai", "google", "ollama"
    model_name="gpt-4o-mini",
    api_key="YOUR_API_KEY",  # Required for OpenAI/Google, optional for Ollama
    output_schema=MyPydanticModel,  # Pydantic model for structured output
    base_url=None,  # Optional: custom endpoint (e.g., Ollama instance)
    reasoning=False,  # Optional: enable reasoning for supported models
)

Fields:

Field	Type	Required	Purpose
`provider`	`str`	Yes	Model provider: `"openai"`, `"google"`, or `"ollama"`
`model_name`	`str`	Yes	Model identifier (e.g., `"gpt-4o-mini"`, `"llama3:8b"`)
`api_key`	`str`	Conditional	API key for OpenAI/Google, optional for Ollama
`output_schema`	`BaseModel`	Conditional	Pydantic model for structured output
`base_url`	`str`	Optional	Custom endpoint URL (required for Ollama)
`reasoning`	`bool`	No	Enable reasoning for supported models

Provider-Specific Requirements¶

OpenAI¶

api_key required
Supports GPT models (gpt-3.5-turbo, gpt-4, gpt-4o, etc.)
No base_url needed (uses default OpenAI endpoint)

Google¶

api_key required
Supports Gemini models (gemini-pro, gemini-1.5-pro, etc.)
No base_url needed (uses default Google endpoint)

Ollama¶

base_url required (e.g., "http://localhost:11434/")
api_key optional
Supports local models (llama3, mistral, codellama, etc.)
Must have Ollama server running with the specified model

Model names change over time

Check your provider's current model list before publishing examples or running scheduled jobs. Keep model identifiers configurable in production.

BrowserSettings¶

Top-level browser settings. It contains launch, context, runtime, and proxy settings.

settings = CrawlerSettings(
    browser_settings=BrowserSettings(
        context=ContextSettings(viewport={"width": 1366, "height": 768})
    )
)

Use browser settings for custom viewport, user agent, proxy, locale, timezone, storage state, HTTPS behavior, and Playwright runtime timeouts.

ProxySettings¶

Proxy settings for browser and sitemap workflows.

settings = CrawlerSettings(
    proxies=[
        ProxySettings(server="http://proxy-1.example:8080"),
        ProxySettings(
            server="http://proxy-2.example:8080",
            username="user",
            password="pass",
        ),
    ],
    proxy_rotation_method="round_robin",
)

Use proxy=ProxySettings(...) for one proxy. Use proxies=[...] for a rotating pool. Supported rotation strategies are round_robin and random.

HumanBehaviorSettings¶

Delay, scroll, and mouse movement settings for optional browser behavior simulation.

settings = CrawlerSettings(
    enable_human_behaviors=True,
    human_behavior_settings=HumanBehaviorSettings(max_scrolls=20),
)

This affects deep browser link extraction. It is useful for lazy-loaded links but reduces throughput.

Use only where needed

Human behavior simulation is helpful for lazy-loaded pages, but it should not be a default for every crawl.

Pipeline¶

A comprehensive web crawling pipeline that orchestrates browser automation, link extraction, and content scraping in a single unified workflow.

Proxy configuration is required for production

Pipeline performs browser discovery and content extraction together. Use explicit proxy settings and conservative concurrency for production runs.

# Basic usage
settings = CrawlerSettings(
    link_extraction_limit=100,
    concurrency=5,
    proxies=[ProxySettings(server="http://proxy.example.com:8080")]
)

async with Pipeline(settings) as engine:
    results = await engine.run("https://example.com")

# With date filtering
async with Pipeline(settings, start_date="2024-01-01", end_date="2024-12-31") as engine:
    results = await engine.run("https://example.com")

Returns a list of content dictionaries with extracted data from discovered pages.

Key Features¶

Orchestrated Workflow: Combines link discovery, browser automation, and content extraction
Date Filtering: Filter content by publication date range
Human Behavior Simulation: Optional realistic browsing patterns
Proxy Support: Built-in proxy rotation for production crawling
Concurrent Processing: Configurable worker pool for efficient crawling

Constructor Parameters¶

Parameter	Type	Required	Default	Purpose
`settings`	`CrawlerSettings`	Yes	-	Configuration for all crawling components
`start_date`	`str`	No	`None`	Filter content from this date (YYYY-MM-DD)
`end_date`	`str`	No	`None`	Filter content until this date (YYYY-MM-DD)

Proxy Configuration¶

Required for production use:

settings = CrawlerSettings(
    proxies=[
        ProxySettings(server="http://proxy1.example.com:8080"),
        ProxySettings(server="http://proxy2.example.com:8080"),
    ],
    proxy_rotation_method="round_robin",
)

Without proper proxy configuration, your crawler may be blocked by target websites.

Date filtering depends on extracted metadata

start_date and end_date work when extracted content includes a filedate or date field in YYYY-MM-DD format.

Usage Patterns¶

Simple crawling:

async with Pipeline(settings) as engine:
    content = await engine.run("https://example.com")

With date filtering:

async with Pipeline(settings, 
                         start_date="2024-01-01", 
                         end_date="2024-06-30") as engine:
    content = await engine.run("https://example.com/news")

Manual lifecycle:

engine = Pipeline(settings)
await engine.start()
try:
    content = await engine.run("https://example.com")
finally:
    await engine.close()

LinkClassifierPipeline¶

Publicly exported link classifier pipeline used by shallow extraction when link_classification=True. Most users should start with include_link_patterns because path filters are explicit, easy to debug, and deterministic.