Quick Start¶

This guide shows the fastest useful paths through OneCrawler. The examples are small enough to paste into a script, but they use the same structure you would keep in a scheduled production job.

Install¶

pip install onecrawler
python -m playwright install chromium

Install the Playwright browser only when you plan to use browser-backed link extraction or browser-backed scraping.

Start with the lightest workflow

If a site has a sitemap, use UniversalSiteMap before opening browser pages. Browser workflows are more flexible, but they cost more time, memory, and operational care.

Best First Workflow: Sitemap Then Scrape¶

Use this pattern when the target site publishes a sitemap. It is the preferred starting point for most crawls because it avoids unnecessary browser navigation.

import json
import asyncio
from onecrawler import CrawlerSettings, ScraperEngine, UniversalSiteMap


async def main():
    settings = CrawlerSettings(
        link_extraction_limit=100,
        include_link_patterns=["/articles/*"],
        scraping_strategy="heuristic",
        scraping_output_format="json",
        concurrency=8,
        request_timeout=15,
        max_retries=3,
    )

    sitemap = UniversalSiteMap(settings)
    urls = await sitemap.run("https://example.com")

    async with ScraperEngine(settings) as scraper:
        records = await scraper.run(urls)

    with open("articles.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)


if __name__ == "__main__":
    asyncio.run(main())

This workflow is ideal for news sections, blogs, documentation sites, catalogs, and any site that exposes stable URL metadata.

Save discovered URLs

For production jobs, persist the URL list before scraping. It makes retries much easier because failed extraction can be rerun without repeating discovery.

Browser Discovery Workflow¶

Use browser-based link extraction when the site does not publish useful sitemaps, or when links are generated by JavaScript.

import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine


async def main():
    settings = CrawlerSettings(
        link_extraction_strategy="deep",
        link_extraction_limit=250,
        include_link_patterns=["/news/*"],
        concurrency=5,
    )

    async with LinkExtractionEngine(settings) as engine:
        links = await engine.run("https://example.com/news")

    print(f"Collected {len(links)} links")


if __name__ == "__main__":
    asyncio.run(main())

Choose shallow when the start page is a listing page and you only need its direct links. Choose deep when you need recursive traversal within the same site.

Keep deep crawls scoped

Always combine deep crawling with link_extraction_limit and include_link_patterns. A broad start page can quickly discover login, search, tag, archive, and policy URLs you did not intend to scrape.

Structured GenAI Workflow¶

Use GenAI extraction when plain article text is not enough and your application needs fields in a predefined shape.

import asyncio
from typing import Optional
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine


class ArticleSummary(BaseModel):
    title: str
    author: Optional[str] = None
    published_at: Optional[str] = None
    summary: str
    topics: list[str]


async def main():
    settings = CrawlerSettings(
        scraping_strategy="genai",
        scraping_output_format="json",
        genai=GenerativeAISettings(
            provider="openai",
            model_name="gpt-4o-mini",
            api_key="YOUR_API_KEY",
            output_schema=ArticleSummary,
        ),
        concurrency=3,
        request_timeout=30,
    )

    async with ScraperEngine(settings) as scraper:
        result = await scraper.run("https://example.com/articles/story")

    print(result)


if __name__ == "__main__":
    asyncio.run(main())

GenAI extraction should be reserved for semantic or typed output needs. For high volume article text extraction, start with the heuristic strategy and add GenAI only for pages that require interpretation or normalization.

Use GenAI selectively

A good pattern is heuristic extraction first, then GenAI only for records that need classification, summarization, field normalization, or schema-shaped output.

All-in-One Pipeline: Pipeline¶

Use Pipeline when you want a single, orchestrated workflow that combines link discovery, browser automation, and content extraction.

Use proxies for production pipeline crawls

Pipeline opens browser pages and extracts content as it discovers links. For production crawls, configure a single proxy or proxy pool and keep concurrency conservative.

import json
import asyncio
from onecrawler import Pipeline, CrawlerSettings, ProxySettings


async def main():
    settings = CrawlerSettings(
        link_extraction_limit=50,
        include_link_patterns=["/news/*"],
        concurrency=5,
        # Required for production
        proxies=[
            ProxySettings(server="http://proxy1.example.com:8080"),
            ProxySettings(server="http://proxy2.example.com:8080"),
        ],
        proxy_rotation_method="round_robin",
    )

    async with Pipeline(settings) as engine:
        results = await engine.run("https://example.com/news")

    with open("news_articles.json", "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

With date filtering:

async with Pipeline(
    settings, 
    start_date="2024-01-01", 
    end_date="2024-12-31"
) as engine:
    results = await engine.run("https://example.com/news")

For JavaScript-heavy sites:

from onecrawler import HumanBehaviorSettings

settings = CrawlerSettings(
    link_extraction_limit=30,
    concurrency=3,
    enable_human_behaviors=True,
    human_behavior_settings=HumanBehaviorSettings(
        max_scrolls=3,
        min_delay=1.0,
        max_delay=2.0,
    ),
    proxies=[...],  # Always required for production
)

async with Pipeline(settings) as engine:
    results = await engine.run("https://spa-example.com")

Practical Defaults¶

Start conservative:

concurrency=5 to 10 for discovery
concurrency=2 to 5 for GenAI extraction
request_timeout=10 to 30 depending on target speed
max_retries=2 or 3 for unstable sites
include_link_patterns on every broad crawl

Then increase throughput only after you know the target site responds reliably.

High concurrency can reduce success

More workers are not always faster. Browser pages, proxy limits, server rate limits, and model provider limits can all turn high concurrency into retries and empty results.