Sitemap Discovery Package¶
The sitemap package provides discovery and parsing utilities for efficient URL collection.
Prefer sitemap discovery first
Sitemaps are usually faster, cheaper, and more stable than browser crawling. Start here before reaching for LinkExtractionEngine.
Classes¶
UniversalSiteMap¶
High-level sitemap resolver that automatically discovers sitemaps through multiple methods.
from onecrawler import CrawlerSettings, UniversalSiteMap
sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")
Features¶
- robots.txt parsing: Extracts sitemap directives from robots.txt
- Common path discovery: Checks standard sitemap locations
- Nested index parsing: Handles sitemap index files
- HTML fallback: Crawls pages when no sitemaps are found
- Compression support: Handles .xml.gz compressed sitemaps
Use the public import
Most user code should import UniversalSiteMap from onecrawler, not from an internal package path.
SiteMap¶
Lower-level sitemap parser for direct sitemap URL processing.
from onecrawler import CrawlerSettings, SiteMap
sitemap = SiteMap(settings)
urls = await sitemap.run("https://example.com/sitemap.xml")
Features¶
- Direct parsing: Parse specific sitemap URLs
- URL validation: Validates and normalizes URLs
- Metadata extraction: Extracts lastmod, changefreq, priority
SitemapStats¶
Statistics tracking for sitemap operations.
from onecrawler import SitemapStats
stats = SitemapStats()
print(f"Discovered {stats.url_count} URLs")
Properties¶
url_count: Total URLs discoveredsitemap_count: Number of sitemaps processederror_count: Number of errors encounteredelapsed_time: Total processing time
Usage Examples¶
Basic Sitemap Discovery¶
import asyncio
from onecrawler import CrawlerSettings, UniversalSiteMap
async def discover_urls():
settings = CrawlerSettings(
link_extraction_limit=1000,
include_link_patterns=["/articles/*"]
)
sitemap = UniversalSiteMap(settings)
urls = await sitemap.run("https://example.com")
return urls
if __name__ == "__main__":
asyncio.run(discover_urls())
Advanced Configuration¶
from onecrawler import CrawlerSettings, UniversalSiteMap
settings = CrawlerSettings(
follow_sitemap_index=True,
sitemap_html_fallback=True,
max_crawl_depth=3,
max_crawl_pages=500,
sitemap_user_agent="MyCrawler/1.0",
sitemap_respect_robots=True,
sitemap_deduplicate=True
)
sitemap = UniversalSiteMap(settings)
HTML fallback can broaden scope
sitemap_html_fallback=True is useful during exploration, but it can crawl same-origin pages when XML sitemaps are missing. Pair it with link_extraction_limit and include_link_patterns.
Direct Sitemap Parsing¶
from onecrawler import CrawlerSettings, SiteMap
async def parse_specific_sitemap():
settings = CrawlerSettings()
sitemap = SiteMap(settings)
urls = await sitemap.run("https://example.com/sitemap.xml")
return urls
Configuration¶
Sitemap behavior is controlled through CrawlerSettings:
| Setting | Description | Default |
|---|---|---|
follow_sitemap_index | Traverse sitemap indexes | True |
sitemap_html_fallback | Crawl pages when no sitemaps | True |
max_crawl_depth | Depth limit for HTML fallback | 3 |
max_crawl_pages | Page limit for HTML fallback | 500 |
sitemap_user_agent | User agent for sitemap requests | Custom |
sitemap_respect_robots | Follow robots.txt rules | True |
sitemap_deduplicate | Remove duplicate URLs | True |
Discovery Process¶
UniversalSiteMap follows this discovery order:
- robots.txt: Check for
Sitemap:directives - Common paths: Try standard locations:
/sitemap.xml/sitemap_index.xml/sitemap.xml.gz/sitemaps.xml- Nested indexes: Parse sitemap index files recursively
- HTML fallback: Crawl pages if no sitemaps found
Disable fallback for strict sitemap jobs
If a scheduled job should only trust XML sitemap sources, set sitemap_html_fallback=False after you confirm the sitemap URLs you need.
Sitemap Formats Supported¶
Standard XML Sitemap¶
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page1</loc>
<lastmod>2024-01-01</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Sitemap Index¶
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap1.xml</loc>
<lastmod>2024-01-01</lastmod>
</sitemap>
</sitemapindex>
Compressed Sitemaps¶
Supports .xml.gz compressed sitemaps for faster downloads.
Performance Tips¶
- Prefer sitemaps: Always use sitemaps when available
- Set limits: Use
link_extraction_limitto control scope - Filter patterns: Use
include_link_patternsfor targeted URLs - Monitor stats: Track discovery rates and errors
- Fallback control: Disable HTML fallback for predictable jobs
Metadata availability varies
Sitemap fields such as lastmod, changefreq, and priority are optional. Treat them as hints from the publisher, not guaranteed freshness signals.
Error Handling¶
The sitemap system gracefully handles:
- Network errors: Automatic retries with exponential backoff
- Malformed XML: Parser error recovery
- Missing sitemaps: Falls back to HTML discovery
- Rate limiting: Respects retry-after headers
Best Practices¶
- Check robots.txt: Respect site crawling policies
- Use appropriate user agent: Identify your crawler
- Set reasonable limits: Don't overwhelm target servers
- Monitor performance: Track discovery success rates
- Handle errors gracefully: Implement retry logic
Respect crawl policies
Even sitemap discovery can produce a large URL list. Keep limits reasonable, identify your crawler with a user agent when appropriate, and follow the target site's crawling policies.