Settings Package¶
The onecrawler.settings package provides comprehensive configuration classes for all crawler components.
Treat settings as the crawl contract
Put scope, limits, concurrency, retries, proxy behavior, and output format in CrawlerSettings. Explicit settings make crawls easier to review, reproduce, and schedule.
Classes¶
CrawlerSettings¶
Central configuration class that controls all crawler behavior.
from onecrawler import CrawlerSettings
settings = CrawlerSettings(
link_extraction_limit=500,
include_link_patterns=["/articles/*"],
concurrency=10,
scraping_strategy="heuristic"
)
Core Settings¶
| Setting | Type | Default | Description |
|---|---|---|---|
link_extraction_strategy | str | "deep" | Browser discovery mode: "deep" or "shallow" |
link_extraction_limit | int | 50 | Maximum number of URLs to collect |
include_link_patterns | List[str] | None | URL path patterns to include |
exclude_link_patterns | List[str] | None | URL path patterns to exclude |
scraping_strategy | str | "heuristic" | Extraction strategy: "heuristic" or "genai" |
scraping_output_format | str | "json" | Output format for scraped content |
concurrency | int | 10 | Number of async workers |
request_timeout | int | 10 | Per-request timeout in seconds |
max_retries | int | 2 | Retry attempts for failed requests |
retry_delay | int | 1 | Base delay between retries |
Do not run broad crawls without limits
link_extraction_limit and include_link_patterns are your main safety rails. Set them before using deep browser discovery or Pipeline.
Sitemap Settings¶
| Setting | Type | Default | Description |
|---|---|---|---|
follow_sitemap_index | bool | True | Traverse sitemap indexes |
sitemap_html_fallback | bool | True | Crawl pages when no sitemaps found |
max_crawl_depth | int | 3 | Depth limit for HTML fallback |
max_crawl_pages | int | 500 | Page limit for HTML fallback |
sitemap_user_agent | str | Custom | User agent for sitemap requests |
sitemap_respect_robots | bool | True | Follow robots.txt rules |
sitemap_deduplicate | bool | True | Remove duplicate URLs |
Browser Settings¶
| Setting | Type | Default | Description |
|---|---|---|---|
browser_settings | BrowserSettings | Default | Playwright browser configuration |
Proxy Settings¶
| Setting | Type | Default | Description |
|---|---|---|---|
proxy | ProxySettings | None | Single proxy configuration |
proxies | List[ProxySettings] | None | Multiple proxies for rotation |
proxy_rotation_method | str | "round_robin" | Proxy rotation strategy |
Use top-level proxy settings
Prefer proxy or proxies on CrawlerSettings so sitemap, browser, and pipeline workflows share the same network configuration.
GenAI Settings¶
| Setting | Type | Default | Description |
|---|---|---|---|
genai | GenerativeAISettings | None | AI extraction configuration |
GenerativeAISettings¶
Configuration for AI-powered content extraction.
from onecrawler import GenerativeAISettings
genai = GenerativeAISettings(
provider="openai",
model_name="gpt-4o-mini",
api_key="your-api-key",
output_schema=ArticleModel
)
Keep API keys out of source
Pass provider keys through environment variables or your secret manager. Avoid committing keys in examples, settings files, or notebooks.
Fields¶
| Setting | Type | Required | Description |
|---|---|---|---|
provider | str | Yes | Model provider: "openai", "google", or "ollama" |
model_name | str | Yes | Model identifier |
api_key | str | Conditional | API key for OpenAI/Google |
output_schema | BaseModel | Conditional | Pydantic model for structured output |
base_url | str | Optional | Custom endpoint URL (required for Ollama) |
reasoning | bool | No | Enable reasoning for supported models |
BrowserSettings¶
Playwright browser configuration.
from onecrawler import BrowserSettings, ContextSettings
browser_settings = BrowserSettings(
context=ContextSettings(
viewport={"width": 1440, "height": 900},
locale="en-US",
timezone_id="UTC"
)
)
Use storage state for authenticated pages
For logged-in crawls, create a Playwright storage_state file and reference it from ContextSettings. Keep that file private because it may contain cookies.
Context Settings¶
| Setting | Type | Default | Description |
|---|---|---|---|
viewport | dict | {"width": 1366, "height": 768} | Browser viewport size |
locale | str | "en-US" | Browser locale |
timezone_id | str | "UTC" | Timezone identifier |
user_agent | str | Default | Custom user agent |
storage_state | str | None | Path to browser storage state |
ProxySettings¶
Proxy configuration for network requests.
from onecrawler import ProxySettings
proxy = ProxySettings(
server="http://proxy.example:8080",
username="user",
password="pass"
)
Fields¶
| Setting | Type | Required | Description |
|---|---|---|---|
server | str | Yes | Proxy server URL |
username | str | No | Proxy username |
password | str | No | Proxy password |
HumanBehaviorSettings¶
Configuration for human-like browser interactions.
from onecrawler import HumanBehaviorSettings
human_settings = HumanBehaviorSettings(
min_delay=0.5,
max_delay=2.0,
max_scrolls=20,
min_mouse_moves=2,
max_mouse_moves=8
)
Fields¶
| Setting | Type | Default | Description |
|---|---|---|---|
min_delay | float | 0.5 | Minimum delay between actions |
max_delay | float | 2.0 | Maximum delay between actions |
max_scrolls | int | 20 | Maximum scroll actions |
min_mouse_moves | int | 2 | Minimum mouse movements |
max_mouse_moves | int | 8 | Maximum mouse movements |
Simulation trades speed for coverage
Human behavior settings can reveal lazy-loaded links, but each delay and scroll lowers throughput. Enable it only for pages that need it.
Usage Examples¶
Basic Configuration¶
from onecrawler import CrawlerSettings
settings = CrawlerSettings(
link_extraction_limit=100,
include_link_patterns=["/news/*"],
concurrency=5,
request_timeout=15
)
GenAI Configuration¶
from pydantic import BaseModel
from onecrawler import CrawlerSettings, GenerativeAISettings
class Article(BaseModel):
title: str
author: str
content: str
settings = CrawlerSettings(
scraping_strategy="genai",
genai=GenerativeAISettings(
provider="openai",
model_name="gpt-4o-mini",
api_key="your-api-key",
output_schema=Article
)
)
Proxy Configuration¶
from onecrawler import CrawlerSettings, ProxySettings
settings = CrawlerSettings(
proxies=[
ProxySettings(server="http://proxy1.example:8080"),
ProxySettings(server="http://proxy2.example:8080")
],
proxy_rotation_method="round_robin"
)
Browser Configuration¶
from onecrawler import CrawlerSettings, BrowserSettings, ContextSettings
settings = CrawlerSettings(
browser_settings=BrowserSettings(
context=ContextSettings(
viewport={"width": 1920, "height": 1080},
locale="en-US",
user_agent="MyCrawler/1.0"
)
)
)
Configuration Validation¶
CrawlerSettings includes automatic validation:
# This will raise an error
try:
settings = CrawlerSettings(
scraping_strategy="genai",
scraping_output_format="markdown" # GenAI only supports JSON
)
except ValueError as e:
print(f"Configuration error: {e}")
Validation Rules¶
- GenAI strategy: Requires
genaisettings and JSON output format - Proxy configuration: Cannot use both
proxyandproxiessimultaneously - Human behavior: Only applies to deep link extraction
- Output formats: GenAI extraction limited to JSON format
Let validation fail early
Build CrawlerSettings near application startup. Invalid proxy combinations or GenAI output formats will fail before a long crawl begins.
Environment Variables¶
Settings can be configured using environment variables:
import os
from onecrawler import CrawlerSettings, GenerativeAISettings
settings = CrawlerSettings(
concurrency=int(os.getenv("CRAWLER_CONCURRENCY", 10)),
request_timeout=int(os.getenv("CRAWLER_REQUEST_TIMEOUT", 10)),
genai=GenerativeAISettings(
provider="openai",
api_key=os.getenv("OPENAI_API_KEY"),
model_name="gpt-4o-mini"
)
)
Configuration Files¶
Settings can be loaded from configuration files:
import yaml
from onecrawler import CrawlerSettings
# config.yaml
# link_extraction_limit: 500
# concurrency: 8
# scraping_strategy: "heuristic"
with open("config.yaml", "r") as f:
config = yaml.safe_load(f)
settings = CrawlerSettings(**config)
Best Practices¶
- Set explicit limits: Always configure
link_extraction_limit - Use path filters: Apply
include_link_patternsfor focused crawling - Configure timeouts: Set appropriate
request_timeoutvalues - Monitor resources: Adjust
concurrencybased on system capacity - Validate early: Check configuration before starting crawls
- Use environment variables: Keep sensitive data out of code
- Document settings: Maintain configuration documentation