usp.tree¶
Helpers to generate a sitemap tree.
- usp.tree.sitemap_tree_for_homepage(homepage_url: str, web_client: AbstractWebClient | None = None, use_robots: bool = True, use_known_paths: bool = True) AbstractSitemap¶
Using a homepage URL, fetch the tree of sitemaps and pages listed in them.
- Parameters:
homepage_url – Homepage URL of a website to fetch the sitemap tree for, e.g. “http://www.example.com/”.
web_client – Custom web client implementation to use when fetching sitemaps. If
None, aRequestsWebClientwill be used.use_robots – Whether to discover sitemaps through robots.txt.
use_known_paths – Whether to discover sitemaps through common known paths.
- Returns:
Root sitemap object of the fetched sitemap tree.
- usp.tree._UNPUBLISHED_SITEMAP_PATHS = {'.sitemap.xml', 'admin/config/search/xmlsitemap', 'sitemap', 'sitemap-index.xml', 'sitemap-index.xml.gz', 'sitemap-news.xml', 'sitemap-news.xml.gz', 'sitemap.xml', 'sitemap.xml.gz', 'sitemap/sitemap-index.xml', 'sitemap_index.xml', 'sitemap_index.xml.gz', 'sitemap_news.xml', 'sitemap_news.xml.gz'}¶
Paths which are not exposed in robots.txt but might still contain a sitemap.
- usp.tree.sitemap_from_str(content: str) AbstractSitemap¶
Parse sitemap from a string.
Will return the parsed sitemaps, and any sub-sitemaps will be returned as
InvalidSitemap.- Parameters:
content – Sitemap string to parse
- Returns:
Parsed sitemap