Scraping Engine Package¶

The scraping package provides content extraction engines for scraping web pages with both heuristic and AI-powered approaches.

Start with heuristic extraction

Use the heuristic strategy for bulk article, blog, documentation, or catalog text extraction. Add GenAI only when you need typed fields, summaries, normalization, or semantic interpretation.

Classes¶

ScraperEngine¶

Main scraping engine that supports both heuristic and GenAI content extraction strategies.

from onecrawler import CrawlerSettings, ScraperEngine

async with ScraperEngine(settings) as scraper:
    data = await scraper.run("https://example.com/article")

Parameters¶

settings (CrawlerSettings): Configuration for scraping behavior

Methods¶

run(url: str) -> Any: Extract content from the given URL
run(urls: List[str]) -> List[Any]: Extract content from multiple URLs

Strategies¶

Heuristic: Fast, rule-based extraction using trafilatura
GenAI: AI-powered extraction with structured output

Single URL vs list behavior

ScraperEngine.run() returns one item for a single URL and a list for multiple URLs. Failed extractions are omitted from list results.

HeuristicStrategy¶

Rule-based content extraction using the trafilatura library.

from onecrawler.crawler.scraper.heuristic.script import HeuristicStrategy

strategy = HeuristicStrategy(settings, browser=browser)
content = await strategy.extract(url)

Features¶

Fast extraction: No model calls, deterministic results
Multiple formats: HTML, text, metadata extraction
Language detection: Automatic language identification
Content cleaning: Removes boilerplate and navigation

GenAIStrategy¶

AI-powered content extraction using language models.

from onecrawler.crawler.scraper.genai.executor import GenAIStrategy

strategy = GenAIStrategy(settings=genai_settings)
content = await strategy.extract(url)

Features¶

Structured output: Pydantic schema-based extraction
Semantic understanding: Context-aware content extraction
Field normalization: Consistent data formatting
Custom schemas: Define your own output structure

Usage Examples¶

Heuristic Scraping¶

import asyncio
from onecrawler import CrawlerSettings, ScraperEngine

async def scrape_heuristic():
    settings = CrawlerSettings(
        scraping_strategy="heuristic",
        scraping_output_format="json",
        concurrency=10,
        request_timeout=15
    )

    async with ScraperEngine(settings) as scraper:
        data = await scraper.run("https://example.com/article")

    return data

if __name__ == "__main__":
    asyncio.run(scrape_heuristic())

GenAI Scraping¶

import asyncio
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings, ScraperEngine

class Article(BaseModel):
    title: str
    author: str
    content: str
    published_date: str

async def scrape_with_ai():
    settings = CrawlerSettings(
        scraping_strategy="genai",
        scraping_output_format="json",
        genai=GenerativeAISettings(
            provider="openai",
            model_name="gpt-4o-mini",
            api_key="your-api-key",
            output_schema=Article
        ),
        concurrency=2,
        request_timeout=30
    )

    async with ScraperEngine(settings) as scraper:
        data = await scraper.run("https://example.com/article")

    return data

if __name__ == "__main__":
    asyncio.run(scrape_with_ai())

GenAI has operational cost

Model extraction is slower and can hit provider rate limits. Keep concurrency low, increase request_timeout, and monitor cost per page.

Batch Scraping¶

async def scrape_multiple():
    urls = [
        "https://example.com/article1",
        "https://example.com/article2",
        "https://example.com/article3"
    ]

    settings = CrawlerSettings(
        scraping_strategy="heuristic",
        concurrency=5,
        max_retries=3
    )

    async with ScraperEngine(settings) as scraper:
        results = await scraper.run(urls)

    return results

Persist failed URLs

For large batches, save failed or empty URLs separately. Retrying only failures is faster than repeating discovery and scraping the whole batch.

Configuration¶

Scraping behavior is controlled through CrawlerSettings:

Setting	Description	Default
`scraping_strategy`	`"heuristic"` or `"genai"`	`"heuristic"`
`scraping_output_format`	Output format	`"json"`
`concurrency`	Number of parallel workers	`10`
`request_timeout`	Per-request timeout	`10`
`max_retries`	Retry attempts	`2`
`genai`	GenAI configuration	`None`

Output Formats¶

Heuristic Strategy¶

JSON: Structured data with metadata
Markdown: Clean text formatting
HTML: Original HTML structure
Text: Plain text content

GenAI Strategy¶

JSON: Structured output matching Pydantic schema
Custom formats: Based on your schema definition

GenAI output is JSON-only

Configure scraping_output_format="json" when using scraping_strategy="genai". Other formats are rejected during settings validation.

Performance Considerations¶

Heuristic Scraping¶

Fast: No model calls, deterministic timing
Lightweight: Lower memory and CPU usage
Scalable: Higher concurrency possible
Consistent: Predictable performance

GenAI Scraping¶

Slower: Model response time (10-30 seconds)
Expensive: API costs per request
Limited concurrency: Lower parallelism
Variable: Response time depends on model

Split discovery and scraping

Discover URLs once, store them, then scrape in controlled batches. This makes retries, rate-limit recovery, and cost tracking much easier.

Best Practices¶

When to Use Heuristic¶

Bulk content extraction: Large numbers of pages
Fast processing: Time-sensitive operations
Cost efficiency: Budget-conscious projects
Simple content: Articles, blog posts, documentation

When to Use GenAI¶

Structured data: Specific fields required
Complex content: Mixed or unstructured pages
Normalization: Consistent data formatting
Semantic extraction: Understanding context

General Tips¶

Start with heuristic: Faster and cheaper
Filter URLs: Only scrape relevant pages
Set timeouts: Handle slow pages gracefully
Monitor errors: Track failure rates
Batch processing: Group similar requests

Error Handling¶

The scraper handles various error conditions:

Network errors: Automatic retries with backoff
Parsing errors: Graceful degradation
Timeout errors: Configurable timeouts
Rate limiting: Respects server limits
Model errors: Fallback strategies for GenAI

Integration Examples¶

Save to File¶

import json
from onecrawler import CrawlerSettings, ScraperEngine

async def scrape_and_save():
    settings = CrawlerSettings(scraping_strategy="heuristic")

    async with ScraperEngine(settings) as scraper:
        data = await scraper.run("https://example.com/article")

    with open("output.json", "w") as f:
        json.dump(data, f, indent=2)

Database Integration¶

import asyncio
from onecrawler import CrawlerSettings, ScraperEngine

async def scrape_to_database():
    settings = CrawlerSettings(scraping_strategy="heuristic")

    async with ScraperEngine(settings) as scraper:
        urls = ["https://example.com/page1", "https://example.com/page2"]
        results = await scraper.run(urls)

    # Save to database
    for result in results:
        await save_to_database(result)

Troubleshooting¶

Common Issues¶

Empty results: Check URL accessibility and content structure
Timeout errors: Increase request_timeout or reduce concurrency
GenAI failures: Verify API keys and model availability
Memory issues: Reduce concurrency for large batches
Rate limits: Implement delays between requests

Debug Mode¶

Enable detailed logging for troubleshooting:

settings = CrawlerSettings(
    enable_logging=True,
    logging_level="DEBUG"
)

Empty content is not always an error

Some pages are navigation, search, login, or media-only pages with little extractable text. Use URL filters to keep these out of scraping batches.