Troubleshooting¶
This page lists common failure modes and the first things to check. Most crawler issues come from scope, browser setup, target-site behavior, or overly aggressive concurrency.
Debug in stages
Separate discovery from scraping while troubleshooting. First confirm the URL list, then scrape one URL, then scale to batches.
Sitemap Discovery Returns No URLs¶
Check whether the site publishes a sitemap:
Then try:
- remove or loosen
include_link_patterns - increase
request_timeout - keep
sitemap_html_fallback=True - lower
concurrencyif the site is rejecting bursts - start from the canonical origin, such as
https://www.example.com
Some sites put sitemaps on a different host or subdomain. If robots.txt points there, use the main site URL and let UniversalSiteMap follow the sitemap directive.
Try the canonical host
https://example.com, https://www.example.com, and regional subdomains can publish different robots.txt and sitemap files. Start from the public URL users actually visit.
Link Extraction Finds Too Few Links¶
Common causes:
- the links are outside the starting URL's host
- the links appear only after scrolling or JavaScript interaction
include_link_patternsis too restrictive- the page uses buttons or client-side routes instead of anchor tags
- the section requires authentication
Try shallow extraction first on the exact page where you can see the links. If that works, switch to deep extraction with the same filters.
For lazy-loaded pages, enable human behavior simulation:
from onecrawler import CrawlerSettings, HumanBehaviorSettings
settings = CrawlerSettings(
enable_human_behaviors=True,
human_behavior_settings=HumanBehaviorSettings(max_scrolls=30),
)
Some links are not real anchors
Pages that navigate with buttons, forms, or client-side router state may not expose normal <a href> links. Those pages may need custom handling outside generic link extraction.
Playwright Browser Errors¶
Install browser binaries:
In containers:
If browser launch fails in CI, check sandbox restrictions and system libraries. The default launch args include --no-sandbox, but some environments still need Playwright's dependency installer.
Verify Playwright separately
Before debugging crawler code, run a tiny Playwright script or browser install check in the same environment. That isolates dependency problems from crawler configuration problems.
Scraping Returns None¶
None means the page could not be fetched, did not contain extractable content, or the extraction strategy failed.
Try:
- open the URL in a browser and confirm it returns content
- increase
request_timeout - lower
concurrency - use browser-backed scraping for JavaScript-rendered pages
- inspect whether the target blocks automated requests
- retry with
scraping_output_format="json"for easier debugging
For batch jobs, persist failed URLs separately so you can retry them without running discovery again.
Check page type
Search pages, login pages, image galleries, and policy pages often return little or no article-like content. Filter them out with include_link_patterns.
GenAI Configuration Errors¶
If scraping_strategy="genai", you must provide genai settings and keep scraping_output_format="json".
settings = CrawlerSettings(
scraping_strategy="genai",
scraping_output_format="json",
genai=GenerativeAISettings(
provider="openai",
model_name="gpt-4o-mini",
api_key="YOUR_API_KEY",
output_schema=MySchema,
),
)
Use low concurrency for GenAI workflows to avoid provider rate limits.
GenAI failures may be provider-side
Rate limits, model availability, schema validation, and network failures can all look similar in batch logs. Capture failed URLs and error messages for retry.
Slow Crawls¶
Slow crawls usually come from browser overhead, target latency, retries, or simulated human behavior.
Improve throughput in this order:
- Prefer sitemap discovery over browser crawling.
- Narrow
include_link_patterns. - Disable human behavior simulation unless needed.
- Increase
concurrencygradually. - Reduce
link_extraction_limitto the batch size you actually need. - Save discovered URLs and scrape them in separate batches.
If failures rise as speed increases, back off. A slightly slower crawl with stable results is better than a fast crawl full of retries and missing pages.
Tune from narrow to broad
Start with one section, a small limit, and low concurrency. Increase scope only after results and error rates look healthy.
Rate Limits And Blocking¶
Symptoms include 403, 429, timeouts, or many empty pages.
Recommended response:
- lower
concurrency - increase
retry_delay - use a clear user agent
- respect robots.txt and site terms
- crawl narrower sections
- schedule jobs during quieter periods
Avoid treating blocking as only a technical problem. Production crawling should be predictable and respectful.
Respect target sites
Proxies, retries, and delays are operational tools, not permission to ignore robots.txt, terms, or rate limits.
GitHub Pages Markdown Does Not Render¶
Make sure the page is inside the published docs/ directory and has YAML front matter:
For a project site, set GitHub Pages to publish from the main branch and the /docs folder.