Settings¶
CrawlerSettings is the shared configuration object used by sitemap discovery, link extraction, and scraping. In production, treat it as the contract for a crawl: it defines scope, speed, retry behavior, browser behavior, and output shape.
from onecrawler import CrawlerSettings
settings = CrawlerSettings(
link_extraction_limit=500,
include_link_patterns=["/docs/*"],
concurrency=8,
request_timeout=15,
max_retries=3,
)
Make scope explicit
Set link_extraction_limit and include_link_patterns before running broad discovery. These two fields are the easiest way to keep crawls predictable.
Core Settings¶
| Field | Default | Use it for |
|---|---|---|
link_extraction_strategy | "deep" | Browser link discovery mode: deep or shallow |
link_extraction_limit | 50 | Hard cap on collected links |
include_link_patterns | None | Allow-list URL paths such as ["/news/*"] |
exclude_link_patterns | None | Reserved for exclusion-style filtering |
scraping_strategy | "heuristic" | heuristic or genai extraction |
scraping_output_format | "json" | markdown, json, csv, html, python, txt, xml, or xmltei |
concurrency | 10 | Number of async workers |
max_retries | 2 | Retry attempts for transient failures |
request_timeout | 10 | Per-request timeout in seconds |
retry_delay | 1 | Base delay between retries |
enable_logging | False | Whether your app should settingsure logging |
logging_level | "INFO" | Desired log level |
Sitemap Settings¶
| Field | Default | Use it for |
|---|---|---|
follow_sitemap_index | True | Traverse sitemap indexes and nested XML sitemaps |
sitemap_html_fallback | True | Crawl same-origin HTML pages when no sitemap records are found |
max_crawl_depth | 3 | Depth limit for HTML fallback |
max_crawl_pages | 500 | Page cap for HTML fallback |
sitemap_user_agent | OneCrawler UA | User agent for sitemap HTTP requests |
sitemap_respect_robots | True | Intended robots.txt behavior |
sitemap_deduplicate | True | Normalize and remove duplicate sitemap URLs |
Best practice: keep sitemap_html_fallback=True during exploration, then turn it off for predictable scheduled jobs if you only trust XML sitemap sources.
HTML fallback is discovery, not scraping
Sitemap HTML fallback is only for finding URLs when XML sources are missing. Use ScraperEngine or Pipeline to extract page content after URLs are discovered.
Browser Settings¶
browser_settings controls Playwright launch, context, proxy, and timeout behavior. Use it when the target site needs JavaScript rendering, a custom user agent, proxy routing, a stored session, or a different viewport.
from onecrawler import BrowserSettings, ContextSettings, CrawlerSettings
settings = CrawlerSettings(
browser_settings=BrowserSettings(
context=ContextSettings(
viewport={"width": 1440, "height": 900},
locale="en-US",
timezone_id="UTC",
)
)
)
For authenticated crawling, use Playwright storage state:
from onecrawler import BrowserSettings, ContextSettings, CrawlerSettings
settings = CrawlerSettings(
browser_settings=BrowserSettings(
context=ContextSettings(storage_state="auth-state.json")
)
)
Do not commit storage state
Playwright storage state can contain cookies or authenticated session data. Keep those files out of version control and rotate them like credentials.
Proxy Settings¶
Use proxy for a single proxy or proxies for a rotating proxy pool. The top-level settings are the recommended API because they can be shared across sitemap discovery and browser-backed workflows.
from onecrawler import CrawlerSettings, ProxySettings
settings = CrawlerSettings(
proxy=ProxySettings(
server="http://proxy.example:8080",
username="user",
password="pass",
)
)
Multiple proxies can rotate with round_robin or random:
settings = CrawlerSettings(
proxies=[
ProxySettings(server="http://proxy-1.example:8080"),
ProxySettings(server="http://proxy-2.example:8080"),
],
proxy_rotation_method="round_robin",
)
proxy and proxies are mutually exclusive. Use one proxy for a stable route and a proxy pool when sitemap discovery or future request-heavy workflows should spread traffic across multiple endpoints.
Proxy settings are mutually exclusive
Configure either proxy or proxies, not both. CrawlerSettings raises a validation error when both are provided.
Human Behavior Settings¶
enable_human_behaviors adds optional delay, scroll, and mouse-move simulation during deep browser link extraction.
from onecrawler import CrawlerSettings, HumanBehaviorSettings
settings = CrawlerSettings(
enable_human_behaviors=True,
human_behavior_settings=HumanBehaviorSettings(
min_delay=0.5,
max_delay=2.0,
max_scrolls=20,
min_mouse_moves=2,
max_mouse_moves=8,
),
)
Use this sparingly. It can help pages that lazy-load links after scroll, but it also slows crawls significantly. For high-volume discovery, prefer sitemaps first, then plain deep crawling, then human behavior simulation only where needed.
Use human behavior for lazy-loaded links
Enable human behavior simulation when links appear after scrolling. Keep it off for static pages because it deliberately slows every page.
GenAI Settings¶
GenerativeAISettings is required when scraping_strategy="genai". GenAI output is restricted to JSON because structured model responses should be explicit and machine-readable.
Installation¶
First install the GenAI dependencies:
Basic Configuration¶
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings
class Product(BaseModel):
name: str
price: str | None = None
availability: str | None = None
settings = CrawlerSettings(
scraping_strategy="genai", # Required for GenAI extraction
scraping_output_format="json", # GenAI only supports JSON
genai=GenerativeAISettings(
provider="openai", # Options: "openai", "google", "ollama"
model_name="gpt-4o-mini",
api_key="YOUR_API_KEY", # Required for OpenAI/Google, optional for Ollama
output_schema=Product, # Pydantic model for structured output
),
concurrency=2, # Lower concurrency recommended for GenAI
request_timeout=30, # Increase timeout for model responses
)
Provider-Specific Configuration¶
OpenAI¶
genai=GenerativeAISettings(
provider="openai",
model_name="gpt-4o-mini",
api_key="sk-...", # Your OpenAI API key
output_schema=Product,
)
Google¶
genai=GenerativeAISettings(
provider="google",
model_name="gemini-1.5-pro",
api_key="AIza...", # Your Google API key
output_schema=Product,
)
Ollama¶
genai=GenerativeAISettings(
provider="ollama",
model_name="llama3:8b",
base_url="http://localhost:11434/", # Your Ollama instance
output_schema=Product,
# api_key optional for Ollama
)
All Available Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
provider | str | Yes | - | Model provider: "openai", "google", or "ollama" |
model_name | str | Yes | - | Model identifier |
api_key | str | Conditional | None | API key for OpenAI/Google, optional for Ollama |
output_schema | BaseModel | Conditional | None | Pydantic model for structured output |
base_url | str | Optional | None | Custom endpoint URL (required for Ollama) |
reasoning | bool | No | False | Enable reasoning for supported models |
Usage Tips¶
- Lower concurrency: GenAI calls are slower and more expensive. Use
concurrency=1-3. - Increase timeout: Model responses can take 10-30+ seconds. Use
request_timeout=30+. - Structured schemas: Define clear Pydantic models for reliable extraction.
- Error handling: GenAI calls may fail due to rate limits or model errors.
Use GenAI when you need typed fields, normalization, summaries, or extraction that requires interpretation. Avoid it for simple bulk text extraction where heuristic strategy is faster, cheaper, and easier to reproduce.
GenAI requires JSON output
When scraping_strategy="genai", keep scraping_output_format="json" and provide GenerativeAISettings. Other output formats are rejected during settings validation.
Pipeline Configuration¶
Pipeline uses the same CrawlerSettings object but emphasizes specific fields for its orchestrated workflow:
Required for Production¶
Proxy configuration is required for production
Pipeline combines browser navigation and extraction across multiple pages. Use a proxy or proxy pool for production jobs to reduce blocking and keep traffic routing explicit.
Proxy Configuration:
settings = CrawlerSettings(
proxies=[
ProxySettings(server="http://proxy1.example.com:8080"),
ProxySettings(server="http://proxy2.example.com:8080"),
],
proxy_rotation_method="round_robin",
)
Key Pipeline Settings¶
| Field | Recommended for Pipeline | Purpose |
|---|---|---|
link_extraction_limit | 50-200 | Controls total pages crawled in pipeline |
include_link_patterns | Strongly recommended | Scope crawling to relevant sections |
concurrency | 3-8 | Browser workers for link discovery |
enable_human_behaviors | False (default) or True | Simulate human browsing patterns |
human_behavior_settings | Customizable if enabled | Configure delays, scrolls, mouse movements |
Date Filtering Configuration¶
Pipeline supports date-based content filtering via constructor parameters:
# Filter content by publication date
async with Pipeline(settings,
start_date="2024-01-01",
end_date="2024-12-31") as engine:
results = await engine.run("https://example.com/news")
Date Requirements: - Format: YYYY-MM-DD - Content must have filedate or date field - Applied after content extraction
Human Behavior Settings¶
When enable_human_behaviors=True, configure realistic browsing:
settings = CrawlerSettings(
enable_human_behaviors=True,
human_behavior_settings=HumanBehaviorSettings(
min_delay=1.0, # Minimum delay between actions (seconds)
max_delay=3.0, # Maximum delay between actions (seconds)
max_scrolls=5, # Maximum scroll gestures per page
min_mouse_moves=2, # Minimum mouse movements
max_mouse_moves=5, # Maximum mouse movements
mouse_width=100, # Mouse movement area width
mouse_height=100, # Mouse movement area height
min_mouse_steps=5, # Minimum steps per movement
max_mouse_steps=15, # Maximum steps per movement
min_mouse_sleep=0.1, # Minimum sleep between steps
max_mouse_sleep=0.3, # Maximum sleep between steps
),
)
Pipeline Performance Profiles¶
| Use Case | Recommended Settings |
|---|---|
| Small blog | link_extraction_limit=50, concurrency=3, no human behaviors |
| News site | link_extraction_limit=200, concurrency=5, date filtering |
| JavaScript-heavy | link_extraction_limit=100, concurrency=3, enable human behaviors |
| Production crawling | link_extraction_limit=150, concurrency=4, proxy pool required |
Performance Tuning¶
Tune in this order:
- Narrow
include_link_patterns. - Set a realistic
link_extraction_limit. - Start with moderate
concurrency. - Increase
request_timeoutonly for slow sites. - Add retries for flaky targets.
Good starting profiles:
| Scenario | Settings |
|---|---|
| Small docs site | concurrency=5, link_extraction_limit=100 |
| News section sitemap | concurrency=10, link_extraction_limit=500, path filter |
| JavaScript-heavy site | concurrency=3, browser extraction, longer timeout |
| GenAI extraction | concurrency=2, request_timeout=30, schema required |
Caveats¶
High concurrency is not always faster. Browser pages, network limits, target rate limits, and model APIs can all become bottlenecks. Increase concurrency gradually and watch error rates.
include_link_patterns are matched against URL paths. Prefer patterns like "/news/*" or "/docs/*" instead of full URLs.
CrawlerSettings validates GenAI output format at initialization. If you choose scraping_strategy="genai", keep scraping_output_format="json".
Tune one variable at a time
When performance changes, adjust filters, limits, concurrency, timeout, and retries separately. Changing them together makes failures much harder to diagnose.