Sitemap Tree
Calling sitemap_tree_for_homepage() will return the root node of a tree representing the structure of the sitemaps found on a website.
Index vs Page Sitemaps
A small site may just have a single sitemap hosted at /sitemap.xml , but larger sites often use a more complex structure. By convention, sitemaps are limited to 50,000 URLs or 50MB each, so large sites will have to split sitemaps. It’s also common to split sitemaps semantically, such as by language or content type.
Sitemaps are divided into two types:
On a more complex site, in order to find all pages, you would have to fetch the index sitemaps (potentially several levels deep) and then fetch the page sitemaps they reference.
Basic Examples
A small site with a single sitemap located at /sitemap.xml would look like this:
Note
In diagrams like these, square boxes represent index sitemaps and rounded boxes represent page sitemaps. In reality, each page-type sitemap will have a list of pages as its children, but these are omitted for brevity.
Nodes are clickable to access the documentation for that class.
In this case, the sitemap was discovered because it was at a well-known URL. USP has a built-in list (usp.tree._UNPUBLISHED_SITEMAP_PATHS ) of common sitemap locations to check.
Additionally, USP checks the site’s robots.txt file for a sitemap directive. Had the sitemap been declared in robots.txt instead, the tree would look like this:
The sitemap is now a child of the robots.txt file (which we treat as a type of index sitemap) because it’s queried first, and well-known URLs are skipped if they’ve already been retrieved through robots.txt .
Finally, in this third example, the site has sitemaps listed in robots.txt and some additional sitemaps at well-known URLs:
Here, sitemap_news.xml is an example of an XML index sitemap, which contains no pages itself, but just points to 3 sub-sitemaps. It should also be clearer from this example why it’s necessary to add the root node to combine the sitemaps found from robots.txt and well-known URLs.
Sitemap trees will always have an IndexWebsiteSitemap at the root, and will usually consist of IndexXMLSitemap and PagesXMLSitemap (either directly or through a IndexRobotsTxtSitemap ), but other sitemap types are possible . Regardless, all sitemap classes implement the same interface (AbstractIndexSitemap or AbstractPagesSitemap , which both inherit from AbstractSitemap ), so the actual type of sitemap is not important for most use cases.
Real-World Example
Large and well-established sites (e.g. media outlets) may have very complex sitemap hierarchies, due to the amount of content and changing technologies for the site. For example, this is the sitemap hierarchy for the BBC website:
G
s137614679169408
/
s137614698158272
/robots.txt
s137614679169408->s137614698158272
s137614647809024
/sitemap.xml
s137614698158272->s137614647809024
s137614611578944
/sitemaps/https-index-uk-archive.xml
s137614698158272->s137614611578944
s137614675112448
/sitemaps/https-index-uk-news.xml
s137614698158272->s137614675112448
s137614631504256
/food/sitemap.xml
s137614698158272->s137614631504256
s137614670741248
/bitesize/sitemap/sitemapindex.xml
s137614698158272->s137614670741248
s137614627379072
/teach/sitemap/sitemapindex.xml
s137614698158272->s137614627379072
s137614701232576
/sitemaps/https-index-uk-archive_video.xml
s137614698158272->s137614701232576
s137614604678784
/sitemaps/https-index-uk-video.xml
s137614698158272->s137614604678784
s137614666594880
/sitemaps/sitemap-uk-ws-topics.xml
s137614698158272->s137614666594880
s137614583722752
/sport/sitemap.xml
s137614698158272->s137614583722752
s137614583710144
/sitemaps/sitemap-uk-topics.xml
s137614698158272->s137614583710144
s137614652409536
/ideas/sitemap.xml
s137614698158272->s137614652409536
s137614583211392
/tiny-happy-people/sitemap/sitemapindex.xml
s137614698158272->s137614583211392
s137614711841152
/sport/sitemap.xml
s137614647809024->s137614711841152
s137614631855936
/news/localnews/locations/sitemap.xml
s137614647809024->s137614631855936
s137614647672448
/news/politics/eu-regions/vote2014_sitemap.xml
s137614647809024->s137614647672448
s137614643457152
/news/politics/councils/vote2014_sitemap.xml
s137614647809024->s137614643457152
s137614647670784
/news/events/vote2014/sitemap.xml
s137614647809024->s137614647670784
s137614611236736
/learningenglish/sitemap.xml
s137614647809024->s137614611236736
p137614711841152
43 pages
s137614711841152->p137614711841152
p137614631855936
17752 pages
s137614631855936->p137614631855936
p137614647672448
12 pages
s137614647672448->p137614647672448
p137614643457152
204 pages
s137614643457152->p137614643457152
p137614647670784
0 pages
s137614647670784->p137614647670784
p137614611236736
11134 pages
s137614611236736->p137614611236736
s137614647819584
/sitemaps/https-sitemap-uk-archive-1.xml
s137614611578944->s137614647819584
s137614611125504
/sitemaps/https-sitemap-uk-archive-2.xml
s137614611578944->s137614611125504
s137614611120960
/sitemaps/https-sitemap-uk-archive-3.xml
s137614611578944->s137614611120960
s137614611132928
/sitemaps/https-sitemap-uk-archive-4.xml
s137614611578944->s137614611132928
s137614611465472
/sitemaps/https-sitemap-uk-archive-5.xml
s137614611578944->s137614611465472
s137614611467776
/sitemaps/https-sitemap-uk-archive-6.xml
s137614611578944->s137614611467776
s137614595150208
/sitemaps/https-sitemap-uk-archive-7.xml
s137614611578944->s137614595150208
s137614648002496
/sitemaps/https-sitemap-uk-archive-8.xml
s137614611578944->s137614648002496
s137614647836992
/sitemaps/https-sitemap-uk-archive-9.xml
s137614611578944->s137614647836992
s137614611134272
/sitemaps/https-sitemap-uk-archive-10.xml
s137614611578944->s137614611134272
s137614611122880
/sitemaps/https-sitemap-uk-archive-11.xml
s137614611578944->s137614611122880
s137614611239680
/sitemaps/https-sitemap-uk-archive-12.xml
s137614611578944->s137614611239680
s137614611075328
/sitemaps/https-sitemap-uk-archive-13.xml
s137614611578944->s137614611075328
s137614611125696
/sitemaps/https-sitemap-uk-archive-14.xml
s137614611578944->s137614611125696
s137614611305856
/sitemaps/https-sitemap-uk-archive-15.xml
s137614611578944->s137614611305856
s137614595736768
/sitemaps/https-sitemap-uk-archive-16.xml
s137614611578944->s137614595736768
s137614611470016
/sitemaps/https-sitemap-uk-archive-17.xml
s137614611578944->s137614611470016
s137614612660416
/sitemaps/https-sitemap-uk-archive-18.xml
s137614611578944->s137614612660416
s137614611076736
/sitemaps/https-sitemap-uk-archive-19.xml
s137614611578944->s137614611076736
s137614611080192
/sitemaps/https-sitemap-uk-archive-20.xml
s137614611578944->s137614611080192
s137614611078784
/sitemaps/https-sitemap-uk-archive-21.xml
s137614611578944->s137614611078784
s137614611080704
/sitemaps/https-sitemap-uk-archive-22.xml
s137614611578944->s137614611080704
s137614611081728
/sitemaps/https-sitemap-uk-archive-23.xml
s137614611578944->s137614611081728
s137614612666752
/sitemaps/https-sitemap-uk-archive-24.xml
s137614611578944->s137614612666752
s137614611273088
/sitemaps/https-sitemap-uk-archive-25.xml
s137614611578944->s137614611273088
s137614611080960
/sitemaps/https-sitemap-uk-archive-26.xml
s137614611578944->s137614611080960
s137614611087168
/sitemaps/https-sitemap-uk-archive-27.xml
s137614611578944->s137614611087168
s137614611125824
/sitemaps/https-sitemap-uk-archive-28.xml
s137614611578944->s137614611125824
s137614604391616
/sitemaps/https-sitemap-uk-archive-29.xml
s137614611578944->s137614604391616
s137614602720512
/sitemaps/https-sitemap-uk-archive-30.xml
s137614611578944->s137614602720512
s137614611071808
/sitemaps/https-sitemap-uk-archive-31.xml
s137614611578944->s137614611071808
s137614602720832
/sitemaps/https-sitemap-uk-archive-32.xml
s137614611578944->s137614602720832
s137614611083520
/sitemaps/https-sitemap-uk-archive-33.xml
s137614611578944->s137614611083520
s137614611267968
/sitemaps/https-sitemap-uk-archive-34.xml
s137614611578944->s137614611267968
s137614611084352
/sitemaps/https-sitemap-uk-archive-35.xml
s137614611578944->s137614611084352
s137614611274304
/sitemaps/https-sitemap-uk-archive-36.xml
s137614611578944->s137614611274304
s137614602722112
/sitemaps/https-sitemap-uk-archive-37.xml
s137614611578944->s137614602722112
s137614602726784
/sitemaps/https-sitemap-uk-archive-38.xml
s137614611578944->s137614602726784
s137614611128704
/sitemaps/https-sitemap-uk-archive-39.xml
s137614611578944->s137614611128704
s137614611131200
/sitemaps/https-sitemap-uk-archive-40.xml
s137614611578944->s137614611131200
s137614604388160
/sitemaps/https-sitemap-uk-archive-41.xml
s137614611578944->s137614604388160
s137614611086400
/sitemaps/https-sitemap-uk-archive-42.xml
s137614611578944->s137614611086400
s137614611082752
/sitemaps/https-sitemap-uk-archive-43.xml
s137614611578944->s137614611082752
s137614611077056
/sitemaps/https-sitemap-uk-archive-44.xml
s137614611578944->s137614611077056
s137614611079680
/sitemaps/https-sitemap-uk-archive-45.xml
s137614611578944->s137614611079680
s137614611076864
/sitemaps/https-sitemap-uk-archive-46.xml
s137614611578944->s137614611076864
s137614602721792
/sitemaps/https-sitemap-uk-archive-47.xml
s137614611578944->s137614602721792
s137614611084224
/sitemaps/https-sitemap-uk-archive-48.xml
s137614611578944->s137614611084224
s137614611304896
/sitemaps/https-sitemap-uk-archive-49.xml
s137614611578944->s137614611304896
s137614611475072
/sitemaps/https-sitemap-uk-archive-50.xml
s137614611578944->s137614611475072
p137614647819584
50000 pages
s137614647819584->p137614647819584
p137614611125504
50000 pages
s137614611125504->p137614611125504
p137614611120960
50000 pages
s137614611120960->p137614611120960
p137614611132928
50000 pages
s137614611132928->p137614611132928
p137614611465472
50000 pages
s137614611465472->p137614611465472
p137614611467776
50000 pages
s137614611467776->p137614611467776
p137614595150208
50000 pages
s137614595150208->p137614595150208
p137614648002496
50000 pages
s137614648002496->p137614648002496
p137614647836992
50000 pages
s137614647836992->p137614647836992
p137614611134272
50000 pages
s137614611134272->p137614611134272
p137614611122880
50000 pages
s137614611122880->p137614611122880
p137614611239680
50000 pages
s137614611239680->p137614611239680
p137614611075328
50000 pages
s137614611075328->p137614611075328
p137614611125696
50000 pages
s137614611125696->p137614611125696
p137614611305856
50000 pages
s137614611305856->p137614611305856
p137614595736768
50000 pages
s137614595736768->p137614595736768
p137614611470016
50000 pages
s137614611470016->p137614611470016
p137614612660416
50000 pages
s137614612660416->p137614612660416
p137614611076736
50000 pages
s137614611076736->p137614611076736
p137614611080192
50000 pages
s137614611080192->p137614611080192
p137614611078784
50000 pages
s137614611078784->p137614611078784
p137614611080704
50000 pages
s137614611080704->p137614611080704
p137614611081728
50000 pages
s137614611081728->p137614611081728
p137614612666752
50000 pages
s137614612666752->p137614612666752
p137614611273088
50000 pages
s137614611273088->p137614611273088
p137614611080960
50000 pages
s137614611080960->p137614611080960
p137614611087168
50000 pages
s137614611087168->p137614611087168
p137614611125824
50000 pages
s137614611125824->p137614611125824
p137614604391616
50000 pages
s137614604391616->p137614604391616
p137614602720512
50000 pages
s137614602720512->p137614602720512
p137614611071808
50000 pages
s137614611071808->p137614611071808
p137614602720832
50000 pages
s137614602720832->p137614602720832
p137614611083520
50000 pages
s137614611083520->p137614611083520
p137614611267968
50000 pages
s137614611267968->p137614611267968
p137614611084352
50000 pages
s137614611084352->p137614611084352
p137614611274304
50000 pages
s137614611274304->p137614611274304
p137614602722112
50000 pages
s137614602722112->p137614602722112
p137614602726784
50000 pages
s137614602726784->p137614602726784
p137614611128704
50000 pages
s137614611128704->p137614611128704
p137614611131200
50000 pages
s137614611131200->p137614611131200
p137614604388160
50000 pages
s137614604388160->p137614604388160
p137614611086400
50000 pages
s137614611086400->p137614611086400
p137614611082752
50000 pages
s137614611082752->p137614611082752
p137614611077056
50000 pages
s137614611077056->p137614611077056
p137614611079680
50000 pages
s137614611079680->p137614611079680
p137614611076864
50000 pages
s137614611076864->p137614611076864
p137614602721792
50000 pages
s137614602721792->p137614602721792
p137614611084224
50000 pages
s137614611084224->p137614611084224
p137614611304896
50000 pages
s137614611304896->p137614611304896
p137614611475072
20973 pages
s137614611475072->p137614611475072
s137614630885120
/sitemaps/https-sitemap-uk-news-1.xml
s137614675112448->s137614630885120
s137614612674752
/sitemaps/https-sitemap-uk-news-2.xml
s137614675112448->s137614612674752
p137614630885120
881 pages
s137614630885120->p137614630885120
p137614612674752
213 pages
s137614612674752->p137614612674752
p137614631504256
21782 pages
s137614631504256->p137614631504256
s137614611268096
/bitesize/sitemap/sitemapindex_part1.xml
s137614670741248->s137614611268096
s137614587353536
/bitesize/sitemap/sitemapindex_part2.xml
s137614670741248->s137614587353536
p137614611268096
50000 pages
s137614611268096->p137614611268096
p137614587353536
9825 pages
s137614587353536->p137614587353536
p137614627379072
6597 pages
s137614627379072->p137614627379072
s137614599090816
/sitemaps/https-sitemap-uk-archive_video-1.xml
s137614701232576->s137614599090816
p137614599090816
22448 pages
s137614599090816->p137614599090816
s137614604670656
/sitemaps/https-sitemap-uk-video-1.xml
s137614604678784->s137614604670656
p137614604670656
63 pages
s137614604670656->p137614604670656
p137614666594880
20259 pages
s137614666594880->p137614666594880
p137614583722752
43 pages
s137614583722752->p137614583722752
p137614583710144
1094 pages
s137614583710144->p137614583710144
p137614652409536
867 pages
s137614652409536->p137614652409536
p137614583211392
1181 pages
s137614583211392->p137614583211392
Altogether, this sitemap tree contains 2.6 million URLs spread across 75 sitemaps. The robots.txt file declares 13 sitemaps, some of which are index sitemaps with as many as 50 page sitemaps. Despite this, USP is able to parse this tree in less than a minute and using no more than 90MiB of memory at peak.
Note also that there is some duplication in this tree. The sitemap /sport/sitemap.xml is both directly declared in robots.txt and also in the index sitemap /sitemap.xml . As these declarations are in different sitemap files, they are both included in the tree. Likewise, the pages declared in the /sport/sitemap.xml file are included in the tree twice. See the section on Deduplication for details.
Traversal
To traverse the sitemaps and pages in the tree, AbstractSitemap declares an interface to access the immediate children of a sitemap node through properties, or all descendants through methods.
These methods and properties are always implemented, returning or yielding empty lists where not applicable (e.g. accessing sub-sitemaps on a page sitemap, or either sub-sitemaps or pages on an invalid sitemap), meaning they can be called without checking the type of the sitemap.
For sub-sitemaps:
For pages: