Link Extraction Package¶

The link extraction package provides classes for discovering links from rendered web pages using browser automation.

Use this after sitemaps

Browser link extraction is best when sitemaps are missing, incomplete, or unable to expose JavaScript-rendered links. If a sitemap is available, start with UniversalSiteMap.

Classes¶

LinkExtractionEngine¶

The main engine for extracting links from web pages using Playwright browser automation.

from onecrawler import CrawlerSettings, LinkExtractionEngine

async with LinkExtractionEngine(settings) as engine:
    links = await engine.run("https://example.com")

Parameters¶

settings (CrawlerSettings): Configuration for link extraction behavior

Methods¶

run(url: str) -> List[str]: Extract links from the given URL

Features¶

Shallow extraction: Extract links from a single page
Deep extraction: Recursively follow same-site links
URL filtering: Include/exclude patterns for targeted extraction
Human behavior simulation: Optional delays and interactions

Choose shallow before deep

Use shallow for listing pages where all target links are visible on one page. Use deep only when you need recursive discovery.

LinkClassifierPipeline¶

Pipeline for classifying and filtering extracted links based on various criteria.

from onecrawler.crawler.link.classifier import LinkClassifierPipeline

classifier = LinkClassifierPipeline(settings)
filtered_links = classifier.classify(links)

Features¶

Domain filtering: Ensure same-origin links
Path pattern matching: Wildcard-based URL filtering
Deduplication: Remove duplicate URLs
Normalization: Clean and standardize URLs

Usage Examples¶

Shallow Link Extraction¶

import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine

async def extract_shallow():
    settings = CrawlerSettings(
        link_extraction_strategy="shallow",
        link_extraction_limit=50,
        include_link_patterns=["/articles/*"]
    )

    async with LinkExtractionEngine(settings) as engine:
        links = await engine.run("https://example.com/latest")

    return links

if __name__ == "__main__":
    asyncio.run(extract_shallow())

Deep Link Extraction¶

import asyncio
from onecrawler import CrawlerSettings, LinkExtractionEngine

async def extract_deep():
    settings = CrawlerSettings(
        link_extraction_strategy="deep",
        link_extraction_limit=300,
        include_link_patterns=["/docs/*"],
        concurrency=5
    )

    async with LinkExtractionEngine(settings) as engine:
        links = await engine.run("https://example.com/docs")

    return links

if __name__ == "__main__":
    asyncio.run(extract_deep())

Deep extraction needs guardrails

Always set link_extraction_limit for deep crawls. Add include_link_patterns whenever you only care about one section, such as /news/* or /docs/*.

With Human Behavior Simulation¶

from onecrawler import CrawlerSettings, LinkExtractionEngine, HumanBehaviorSettings

settings = CrawlerSettings(
    link_extraction_strategy="deep",
    enable_human_behaviors=True,
    human_behavior_settings=HumanBehaviorSettings(
        min_delay=0.5,
        max_delay=1.5,
        max_scrolls=25
    )
)

Human behavior is for lazy loading

Enable human behavior simulation when links appear after scrolling or delayed rendering. Keep it disabled for normal pages because it slows down every worker.

Configuration¶

The link extraction behavior is controlled through CrawlerSettings:

Setting	Description	Default
`link_extraction_strategy`	`"shallow"` or `"deep"`	`"deep"`
`link_extraction_limit`	Maximum links to extract	`50`
`include_link_patterns`	URL path patterns to include	`None`
`exclude_link_patterns`	URL path patterns to exclude	`None`
`concurrency`	Number of parallel browser workers	`10`
`enable_human_behaviors`	Enable human-like interactions	`False`

Performance Considerations¶

Memory usage: Each browser page consumes memory
Concurrency: Start with 3-5 workers, increase gradually
Rate limiting: Respect target site's capacity
Timeouts: Adjust for slow-loading pages

Watch browser resource usage

Each concurrent worker may hold a browser page. If memory, CPU, or target errors climb, reduce concurrency before increasing timeouts or retries.

Best Practices¶

Use sitemaps first: Prefer UniversalSiteMap when available
Filter early: Use include_link_patterns to limit scope
Set limits: Always specify link_extraction_limit
Monitor resources: Watch memory and CPU usage
Handle errors: Implement retry logic for failed pages

Filtering happens on URL paths

include_link_patterns should usually look like "/articles/*" or "/docs/*", not full absolute URLs.