Sitemap Tree

Calling sitemap_tree_for_homepage() will return the root node of a tree representing the structure of the sitemaps found on a website.

Index vs Page Sitemaps

A small site may just have a single sitemap hosted at /sitemap.xml, but larger sites often use a more complex structure. By convention, sitemaps are limited to 50,000 URLs or 50MB each, so large sites will have to split sitemaps. It’s also common to split sitemaps semantically, such as by language or content type.

Sitemaps are divided into two types:

  • Index sitemaps list other sitemaps, which may themselves be index sitemaps or page sitemaps

  • Page sitemaps list pages

On a more complex site, in order to find all pages, you would have to fetch the index sitemaps (potentially several levels deep) and then fetch the page sitemaps they reference.

Basic Examples

A small site with a single sitemap located at /sitemap.xml would look like this:

Note

In diagrams like these, square boxes represent index sitemaps and rounded boxes represent page sitemaps. In reality, each page-type sitemap will have a list of pages as its children, but these are omitted for brevity.

Nodes are clickable to access the documentation for that class.

In this case, the sitemap was discovered because it was at a well-known URL. USP has a built-in list (usp.tree._UNPUBLISHED_SITEMAP_PATHS) of common sitemap locations to check.

Additionally, USP checks the site’s robots.txt file for a sitemap directive. Had the sitemap been declared in robots.txt instead, the tree would look like this:

The sitemap is now a child of the robots.txt file (which we treat as a type of index sitemap) because it’s queried first, and well-known URLs are skipped if they’ve already been retrieved through robots.txt.

Finally, in this third example, the site has sitemaps listed in robots.txt and some additional sitemaps at well-known URLs:

Here, sitemap_news.xml is an example of an XML index sitemap, which contains no pages itself, but just points to 3 sub-sitemaps. It should also be clearer from this example why it’s necessary to add the root node to combine the sitemaps found from robots.txt and well-known URLs.

Sitemap trees will always have an IndexWebsiteSitemap at the root, and will usually consist of IndexXMLSitemap and PagesXMLSitemap (either directly or through a IndexRobotsTxtSitemap), but other sitemap types are possible. Regardless, all sitemap classes implement the same interface (AbstractIndexSitemap or AbstractPagesSitemap, which both inherit from AbstractSitemap), so the actual type of sitemap is not important for most use cases.

Real-World Example

Large and well-established sites (e.g. media outlets) may have very complex sitemap hierarchies, due to the amount of content and changing technologies for the site. For example, this is the sitemap hierarchy for the BBC website:

bbc.co.uk Sitemap Graph
G s137614679169408 / s137614698158272 /robots.txt s137614679169408->s137614698158272 s137614647809024 /sitemap.xml s137614698158272->s137614647809024 s137614611578944 /sitemaps/https-index-uk-archive.xml s137614698158272->s137614611578944 s137614675112448 /sitemaps/https-index-uk-news.xml s137614698158272->s137614675112448 s137614631504256 /food/sitemap.xml s137614698158272->s137614631504256 s137614670741248 /bitesize/sitemap/sitemapindex.xml s137614698158272->s137614670741248 s137614627379072 /teach/sitemap/sitemapindex.xml s137614698158272->s137614627379072 s137614701232576 /sitemaps/https-index-uk-archive_video.xml s137614698158272->s137614701232576 s137614604678784 /sitemaps/https-index-uk-video.xml s137614698158272->s137614604678784 s137614666594880 /sitemaps/sitemap-uk-ws-topics.xml s137614698158272->s137614666594880 s137614583722752 /sport/sitemap.xml s137614698158272->s137614583722752 s137614583710144 /sitemaps/sitemap-uk-topics.xml s137614698158272->s137614583710144 s137614652409536 /ideas/sitemap.xml s137614698158272->s137614652409536 s137614583211392 /tiny-happy-people/sitemap/sitemapindex.xml s137614698158272->s137614583211392 s137614711841152 /sport/sitemap.xml s137614647809024->s137614711841152 s137614631855936 /news/localnews/locations/sitemap.xml s137614647809024->s137614631855936 s137614647672448 /news/politics/eu-regions/vote2014_sitemap.xml s137614647809024->s137614647672448 s137614643457152 /news/politics/councils/vote2014_sitemap.xml s137614647809024->s137614643457152 s137614647670784 /news/events/vote2014/sitemap.xml s137614647809024->s137614647670784 s137614611236736 /learningenglish/sitemap.xml s137614647809024->s137614611236736 p137614711841152 43 pages s137614711841152->p137614711841152 p137614631855936 17752 pages s137614631855936->p137614631855936 p137614647672448 12 pages s137614647672448->p137614647672448 p137614643457152 204 pages s137614643457152->p137614643457152 p137614647670784 0 pages s137614647670784->p137614647670784 p137614611236736 11134 pages s137614611236736->p137614611236736 s137614647819584 /sitemaps/https-sitemap-uk-archive-1.xml s137614611578944->s137614647819584 s137614611125504 /sitemaps/https-sitemap-uk-archive-2.xml s137614611578944->s137614611125504 s137614611120960 /sitemaps/https-sitemap-uk-archive-3.xml s137614611578944->s137614611120960 s137614611132928 /sitemaps/https-sitemap-uk-archive-4.xml s137614611578944->s137614611132928 s137614611465472 /sitemaps/https-sitemap-uk-archive-5.xml s137614611578944->s137614611465472 s137614611467776 /sitemaps/https-sitemap-uk-archive-6.xml s137614611578944->s137614611467776 s137614595150208 /sitemaps/https-sitemap-uk-archive-7.xml s137614611578944->s137614595150208 s137614648002496 /sitemaps/https-sitemap-uk-archive-8.xml s137614611578944->s137614648002496 s137614647836992 /sitemaps/https-sitemap-uk-archive-9.xml s137614611578944->s137614647836992 s137614611134272 /sitemaps/https-sitemap-uk-archive-10.xml s137614611578944->s137614611134272 s137614611122880 /sitemaps/https-sitemap-uk-archive-11.xml s137614611578944->s137614611122880 s137614611239680 /sitemaps/https-sitemap-uk-archive-12.xml s137614611578944->s137614611239680 s137614611075328 /sitemaps/https-sitemap-uk-archive-13.xml s137614611578944->s137614611075328 s137614611125696 /sitemaps/https-sitemap-uk-archive-14.xml s137614611578944->s137614611125696 s137614611305856 /sitemaps/https-sitemap-uk-archive-15.xml s137614611578944->s137614611305856 s137614595736768 /sitemaps/https-sitemap-uk-archive-16.xml s137614611578944->s137614595736768 s137614611470016 /sitemaps/https-sitemap-uk-archive-17.xml s137614611578944->s137614611470016 s137614612660416 /sitemaps/https-sitemap-uk-archive-18.xml s137614611578944->s137614612660416 s137614611076736 /sitemaps/https-sitemap-uk-archive-19.xml s137614611578944->s137614611076736 s137614611080192 /sitemaps/https-sitemap-uk-archive-20.xml s137614611578944->s137614611080192 s137614611078784 /sitemaps/https-sitemap-uk-archive-21.xml s137614611578944->s137614611078784 s137614611080704 /sitemaps/https-sitemap-uk-archive-22.xml s137614611578944->s137614611080704 s137614611081728 /sitemaps/https-sitemap-uk-archive-23.xml s137614611578944->s137614611081728 s137614612666752 /sitemaps/https-sitemap-uk-archive-24.xml s137614611578944->s137614612666752 s137614611273088 /sitemaps/https-sitemap-uk-archive-25.xml s137614611578944->s137614611273088 s137614611080960 /sitemaps/https-sitemap-uk-archive-26.xml s137614611578944->s137614611080960 s137614611087168 /sitemaps/https-sitemap-uk-archive-27.xml s137614611578944->s137614611087168 s137614611125824 /sitemaps/https-sitemap-uk-archive-28.xml s137614611578944->s137614611125824 s137614604391616 /sitemaps/https-sitemap-uk-archive-29.xml s137614611578944->s137614604391616 s137614602720512 /sitemaps/https-sitemap-uk-archive-30.xml s137614611578944->s137614602720512 s137614611071808 /sitemaps/https-sitemap-uk-archive-31.xml s137614611578944->s137614611071808 s137614602720832 /sitemaps/https-sitemap-uk-archive-32.xml s137614611578944->s137614602720832 s137614611083520 /sitemaps/https-sitemap-uk-archive-33.xml s137614611578944->s137614611083520 s137614611267968 /sitemaps/https-sitemap-uk-archive-34.xml s137614611578944->s137614611267968 s137614611084352 /sitemaps/https-sitemap-uk-archive-35.xml s137614611578944->s137614611084352 s137614611274304 /sitemaps/https-sitemap-uk-archive-36.xml s137614611578944->s137614611274304 s137614602722112 /sitemaps/https-sitemap-uk-archive-37.xml s137614611578944->s137614602722112 s137614602726784 /sitemaps/https-sitemap-uk-archive-38.xml s137614611578944->s137614602726784 s137614611128704 /sitemaps/https-sitemap-uk-archive-39.xml s137614611578944->s137614611128704 s137614611131200 /sitemaps/https-sitemap-uk-archive-40.xml s137614611578944->s137614611131200 s137614604388160 /sitemaps/https-sitemap-uk-archive-41.xml s137614611578944->s137614604388160 s137614611086400 /sitemaps/https-sitemap-uk-archive-42.xml s137614611578944->s137614611086400 s137614611082752 /sitemaps/https-sitemap-uk-archive-43.xml s137614611578944->s137614611082752 s137614611077056 /sitemaps/https-sitemap-uk-archive-44.xml s137614611578944->s137614611077056 s137614611079680 /sitemaps/https-sitemap-uk-archive-45.xml s137614611578944->s137614611079680 s137614611076864 /sitemaps/https-sitemap-uk-archive-46.xml s137614611578944->s137614611076864 s137614602721792 /sitemaps/https-sitemap-uk-archive-47.xml s137614611578944->s137614602721792 s137614611084224 /sitemaps/https-sitemap-uk-archive-48.xml s137614611578944->s137614611084224 s137614611304896 /sitemaps/https-sitemap-uk-archive-49.xml s137614611578944->s137614611304896 s137614611475072 /sitemaps/https-sitemap-uk-archive-50.xml s137614611578944->s137614611475072 p137614647819584 50000 pages s137614647819584->p137614647819584 p137614611125504 50000 pages s137614611125504->p137614611125504 p137614611120960 50000 pages s137614611120960->p137614611120960 p137614611132928 50000 pages s137614611132928->p137614611132928 p137614611465472 50000 pages s137614611465472->p137614611465472 p137614611467776 50000 pages s137614611467776->p137614611467776 p137614595150208 50000 pages s137614595150208->p137614595150208 p137614648002496 50000 pages s137614648002496->p137614648002496 p137614647836992 50000 pages s137614647836992->p137614647836992 p137614611134272 50000 pages s137614611134272->p137614611134272 p137614611122880 50000 pages s137614611122880->p137614611122880 p137614611239680 50000 pages s137614611239680->p137614611239680 p137614611075328 50000 pages s137614611075328->p137614611075328 p137614611125696 50000 pages s137614611125696->p137614611125696 p137614611305856 50000 pages s137614611305856->p137614611305856 p137614595736768 50000 pages s137614595736768->p137614595736768 p137614611470016 50000 pages s137614611470016->p137614611470016 p137614612660416 50000 pages s137614612660416->p137614612660416 p137614611076736 50000 pages s137614611076736->p137614611076736 p137614611080192 50000 pages s137614611080192->p137614611080192 p137614611078784 50000 pages s137614611078784->p137614611078784 p137614611080704 50000 pages s137614611080704->p137614611080704 p137614611081728 50000 pages s137614611081728->p137614611081728 p137614612666752 50000 pages s137614612666752->p137614612666752 p137614611273088 50000 pages s137614611273088->p137614611273088 p137614611080960 50000 pages s137614611080960->p137614611080960 p137614611087168 50000 pages s137614611087168->p137614611087168 p137614611125824 50000 pages s137614611125824->p137614611125824 p137614604391616 50000 pages s137614604391616->p137614604391616 p137614602720512 50000 pages s137614602720512->p137614602720512 p137614611071808 50000 pages s137614611071808->p137614611071808 p137614602720832 50000 pages s137614602720832->p137614602720832 p137614611083520 50000 pages s137614611083520->p137614611083520 p137614611267968 50000 pages s137614611267968->p137614611267968 p137614611084352 50000 pages s137614611084352->p137614611084352 p137614611274304 50000 pages s137614611274304->p137614611274304 p137614602722112 50000 pages s137614602722112->p137614602722112 p137614602726784 50000 pages s137614602726784->p137614602726784 p137614611128704 50000 pages s137614611128704->p137614611128704 p137614611131200 50000 pages s137614611131200->p137614611131200 p137614604388160 50000 pages s137614604388160->p137614604388160 p137614611086400 50000 pages s137614611086400->p137614611086400 p137614611082752 50000 pages s137614611082752->p137614611082752 p137614611077056 50000 pages s137614611077056->p137614611077056 p137614611079680 50000 pages s137614611079680->p137614611079680 p137614611076864 50000 pages s137614611076864->p137614611076864 p137614602721792 50000 pages s137614602721792->p137614602721792 p137614611084224 50000 pages s137614611084224->p137614611084224 p137614611304896 50000 pages s137614611304896->p137614611304896 p137614611475072 20973 pages s137614611475072->p137614611475072 s137614630885120 /sitemaps/https-sitemap-uk-news-1.xml s137614675112448->s137614630885120 s137614612674752 /sitemaps/https-sitemap-uk-news-2.xml s137614675112448->s137614612674752 p137614630885120 881 pages s137614630885120->p137614630885120 p137614612674752 213 pages s137614612674752->p137614612674752 p137614631504256 21782 pages s137614631504256->p137614631504256 s137614611268096 /bitesize/sitemap/sitemapindex_part1.xml s137614670741248->s137614611268096 s137614587353536 /bitesize/sitemap/sitemapindex_part2.xml s137614670741248->s137614587353536 p137614611268096 50000 pages s137614611268096->p137614611268096 p137614587353536 9825 pages s137614587353536->p137614587353536 p137614627379072 6597 pages s137614627379072->p137614627379072 s137614599090816 /sitemaps/https-sitemap-uk-archive_video-1.xml s137614701232576->s137614599090816 p137614599090816 22448 pages s137614599090816->p137614599090816 s137614604670656 /sitemaps/https-sitemap-uk-video-1.xml s137614604678784->s137614604670656 p137614604670656 63 pages s137614604670656->p137614604670656 p137614666594880 20259 pages s137614666594880->p137614666594880 p137614583722752 43 pages s137614583722752->p137614583722752 p137614583710144 1094 pages s137614583710144->p137614583710144 p137614652409536 867 pages s137614652409536->p137614652409536 p137614583211392 1181 pages s137614583211392->p137614583211392

Altogether, this sitemap tree contains 2.6 million URLs spread across 75 sitemaps. The robots.txt file declares 13 sitemaps, some of which are index sitemaps with as many as 50 page sitemaps. Despite this, USP is able to parse this tree in less than a minute and using no more than 90MiB of memory at peak.

Note also that there is some duplication in this tree. The sitemap /sport/sitemap.xml is both directly declared in robots.txt and also in the index sitemap /sitemap.xml. As these declarations are in different sitemap files, they are both included in the tree. Likewise, the pages declared in the /sport/sitemap.xml file are included in the tree twice. See the section on Deduplication for details.

Traversal

To traverse the sitemaps and pages in the tree, AbstractSitemap declares an interface to access the immediate children of a sitemap node through properties, or all descendants through methods.

These methods and properties are always implemented, returning or yielding empty lists where not applicable (e.g. accessing sub-sitemaps on a page sitemap, or either sub-sitemaps or pages on an invalid sitemap), meaning they can be called without checking the type of the sitemap.

For sub-sitemaps:

For pages: