Ultimate Sitemap Parser¶
Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.
Supports all sitemap formats: Sitemap XML, Google News, plain text, RSS 2.0, Atom 0.3/1.0.
Error-tolerant: Handles common sitemap bugs gracefully.
Automatic sitemap discovery: Finds sitemaps from robots.txt and from common sitemap names.
Fast and memory efficient: Uses Expat XML parsing, doesn’t consume much memory even with massive sitemap hierarchies. Swaps and lazily loads sub-sitemaps to disk.
Field-tested with ~1 million URLs: Originally developed for the Media Cloud project where it was used to parse approximately 1 million sitemaps.
Installation¶
Ultimate Sitemap Parser can be installed from PyPI or conda-forge:
$ pip install ultimate-sitemap-parser
$ conda install -c conda-forge ultimate-sitemap-parser
Usage¶
USP is very easy to use, with just a single line of code it can traverse and parse a website’s sitemaps:
from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://www.example.org/')
for page in tree.all_pages():
print(page.url)
Advanced Features¶
CLI Client: Use the
usp ls
tool to work with sitemaps from the command lineSerialisation: Export raw data or save to disk and load later
Local Parsing: Use USP’s sitemap parsers on sitemaps which have already been downloaded
Custom web clients: Instead of the default client built on requests you can use your own web client by implementing the
AbstractWebClient
interface.