Ultimate Sitemap Parser¶

Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.

Supports all sitemap formats: Sitemap XML, Google News, plain text, RSS 2.0, Atom 0.3/1.0.
Error-tolerant: Handles common sitemap bugs gracefully.
Automatic sitemap discovery: Finds sitemaps from robots.txt and from common sitemap names.
Fast and memory efficient: Uses Expat XML parsing, doesn’t consume much memory even with massive sitemap hierarchies. Swaps and lazily loads sub-sitemaps to disk.
Field-tested with ~1 million URLs: Originally developed for the Media Cloud project where it was used to parse approximately 1 million sitemaps.

Installation¶

Ultimate Sitemap Parser can be installed from PyPI or conda-forge:

pip

$ pip install ultimate-sitemap-parser

conda

$ conda install -c conda-forge ultimate-sitemap-parser

Usage¶

USP is very easy to use, with just a single line of code it can traverse and parse a website’s sitemaps:

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.example.org/')

for page in tree.all_pages():
    print(page.url)

Advanced Features¶

CLI Client: Use the usp ls tool to work with sitemaps from the command line
Serialisation: Export raw data or save to disk and load later
Local Parsing: Use USP’s sitemap parsers on sitemaps which have already been downloaded
Custom web clients: Instead of the default client built on requests you can use your own web client by implementing the AbstractWebClient interface.