Ultimate Sitemap Parser

Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.

  • Supports all sitemap formats: Sitemap XML, Google News, plain text, RSS 2.0, Atom 0.3/1.0.

  • Error-tolerant: Handles common sitemap bugs gracefully.

  • Automatic sitemap discovery: Finds sitemaps from robots.txt and from common sitemap names.

  • Fast and memory efficient: Uses Expat XML parsing, doesn’t consume much memory even with massive sitemap hierarchies. Swaps and lazily loads sub-sitemaps to disk.

  • Field-tested with ~1 million URLs: Originally developed for the Media Cloud project where it was used to parse approximately 1 million sitemaps.

Installation

Ultimate Sitemap Parser can be installed from PyPI or conda-forge:

$ pip install ultimate-sitemap-parser
$ conda install -c conda-forge ultimate-sitemap-parser

Usage

USP is very easy to use, with just a single line of code it can traverse and parse a website’s sitemaps:

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.example.org/')

for page in tree.all_pages():
    print(page.url)

Advanced Features

  • CLI Client: Use the usp ls tool to work with sitemaps from the command line

  • Serialisation: Export raw data or save to disk and load later

  • Local Parsing: Use USP’s sitemap parsers on sitemaps which have already been downloaded

  • Custom web clients: Instead of the default client built on requests you can use your own web client by implementing the AbstractWebClient interface.