Changelog¶
v1.0.0 (2025-01-13)¶
Ultimate Sitemap Parser is now maintained by the GATE Team at the School of Computer Science, University of Sheffield. We’d like to thank Linas Valiukas and Hal Roberts for their work on this package, and Paige Gulley for coordinating the transfer of the library.
Breaking Changes
Python v3.8 is now the lowest supported version of Python. Future releases will follow Python’s version support.
New Features
CLI tool to parse and list sitemaps on the command line (see CLI Reference)
All sitemap objects now implement a consistent interface, allowing traversal of the tree irrespective of type:
All sitemaps now have
pagesandsub_sitemapsproperties, returning their children of that type, or an empty list where not applicableAdded
all_sitemaps()method to iterate over all descendant sitemaps
Pickling page sitemaps now includes page data, which previously was not included as it was swapped to disk
Sitemaps and pages now implement
to_dict()method to convert to dictionaries (requested in #18)Added optional arguments to
usp.tree.sitemap_tree_for_homepage()to disable robots.txt-based or known-path-based sitemap discovery. Default behaviour is still to use both.Parse sitemaps from a string with Local Parsing (requested in #26)
Support for the Google Image sitemap extension
Add proxy support with
RequestsWebClient.set_proxies()(#20 by @tgrandje)Add additional sitemap discovery paths for news sitemaps (d3bdaae)
Add parameter to
RequestsWebClient.__init__()to disable certificate verification (#37 by @japherwocky)
Performance
Improvement of parse performance by approximately 90%:
Optimised lookup of page URLs when checking if duplicate
Optimised datetime parse in XML Sitemaps by trying full ISO8601 parsers before the general parser
Bug Fixes
Invalid datetimes will be parsed as
Noneinstead of crashing (reported in #22, #31)Invalid priorities will be set to the default (0.5) instead of crashing
Moved
__version__attribute into main class moduleRobots.txt index sitemaps now count for the max recursion depth (reported in #29). The default maximum has been increased by 1 to compensate for this.
Remove log configuration so it can be specified at application level (reported in #25, #24 by @dsoprea/@antonialoytorrens-ikaue)
Resolve warnings caused by
http.HTTPStatususage (3867b6e)Don’t add
InvalidSitemapobject ifrobots.txtis not found (#39 by @gbenson)Fix incorrect lowercasing of URLS discovered in robots.txt (reported in #40, #35 by @ArthurMelin)
Prior versions¶
For versions prior to 1.0, no changelog is available. Use the release tags to compare versions: