usp.tree

Helpers to generate a sitemap tree.

usp.tree.sitemap_tree_for_homepage(homepage_url: str, web_client: AbstractWebClient | None = None, use_robots: bool = True, use_known_paths: bool = True, extra_known_paths: set | None = None) AbstractSitemap

Using a homepage URL, fetch the tree of sitemaps and pages listed in them.

Parameters:
  • homepage_url – Homepage URL of a website to fetch the sitemap tree for, e.g. “http://www.example.com/”.

  • web_client – Custom web client implementation to use when fetching sitemaps. If None, a RequestsWebClient will be used.

  • use_robots – Whether to discover sitemaps through robots.txt.

  • use_known_paths – Whether to discover sitemaps through common known paths.

  • extra_known_paths – Extra paths to check for sitemaps.

Returns:

Root sitemap object of the fetched sitemap tree.

usp.tree._UNPUBLISHED_SITEMAP_PATHS = {'.sitemap.xml', 'admin/config/search/xmlsitemap', 'sitemap', 'sitemap-index.xml', 'sitemap-index.xml.gz', 'sitemap-news.xml', 'sitemap-news.xml.gz', 'sitemap.xml', 'sitemap.xml.gz', 'sitemap/sitemap-index.xml', 'sitemap_index.xml', 'sitemap_index.xml.gz', 'sitemap_news.xml', 'sitemap_news.xml.gz'}

Paths which are not exposed in robots.txt but might still contain a sitemap.

usp.tree.sitemap_from_str(content: str) AbstractSitemap

Parse sitemap from a string.

Will return the parsed sitemaps, and any sub-sitemaps will be returned as InvalidSitemap.

Parameters:

content – Sitemap string to parse

Returns:

Parsed sitemap