Fetch and Parse Process¶

When calling sitemap_tree_for_homepage(), USP will try several methods to find sitemaps and recurse through sub-sitemaps.

Broadly the process is as follows:

Attempt to fetch https://example.org/robots.txt and parse for Sitemap: statements. We consider robots.txt to be an index-type sitemap (as it lists other sitemaps)
Fetch and parse each discovered sitemap URL. If a sitemap is an index-type sitemap, recurse into it.
Try to fetch known sitemap locations like /sitemap.xml and /sitemap_index.xml, excluding those already declared in robots.txt.
Create a top-level dummy sitemap to act as the parent of robots.txt and discovered sitemaps.

Tree Construction¶

Tree Filtering¶

To avoid fetching parts of the sitemap tree that are unwanted, callback functions to filter sub-sitemaps to retrieve can be passed to sitemap_tree_for_homepage().

If a recurse_callback is passed, it will be called with the sub-sitemap URLs one at a time and should return True to fetch or False to skip.

For example, on a multi-lingual site where the language is specified in the URL path, to filter to a specific language:

from usp.tree import sitemap_tree_for_homepage

def filter_callback(url: str, recursion_level: int, parent_urls: Set[str]) -> bool:
    return '/en/' in url

tree = sitemap_tree_for_homepage(
    'https://www.example.org/',
    recurse_callback=filter_callback,
)

If recurse_list_callback is passed, it will be called with the list of sub-sitemap URLs in an index sitemap and should return a filtered list of URLs to fetch.

For example, to only fetch sub-sitemaps if the index sitemap contains both a “blog” and “products” sub-sitemap:

from usp.tree import sitemap_tree_for_homepage

def filter_list_callback(urls: List[str], recursion_level: int, parent_urls: Set[str]) -> List[str]:
    if any('blog' in url for url in urls) and any('products' in url for url in urls):
        return urls
    return []

tree = sitemap_tree_for_homepage(
    'https://www.example.org/',
    recurse_list_callback=filter_list_callback,
)

If either callback is not supplied, the default behaviour is to fetch all sub-sitemaps.

Note

Both callbacks can be used together, and are applied in the order recurse_list_callback then recurse_callback. Therefore if a sub-sitemap URL is filtered out by recurse_list_callback, it will not be fetched even if recurse_callback would return True.

Deduplication¶

During the parse process, some de-duplication is performed within each individual sitemap. In an index sitemap, only the first declaration of a sub-sitemap is fetched. In a page sitemap, only the first declaration of a page is included.

However, this means that if a sub-sitemap is declared in multiple index sitemaps, or a page is declared in multiple page sitemaps, it will be included multiple times.

Recursion is detected in the following cases, and will result in the sitemap being returned as an InvalidSitemap:

A sitemap’s URL is identical to any of its ancestor sitemaps’ URLs.
When fetched, a sitemap redirects to a URL that is identical to any of its ancestor sitemaps’ URLs.
When fetching known site map locations, a sitemap redirects to a sitemap already parsed from robots.txt.