usp.fetch_parse

Sitemap fetchers and parsers.

class usp.fetch_parse.SitemapFetcher

Fetches and parses the sitemap at a given URL, and any declared sub-sitemaps.

__init__(url: str, recursion_level: int, web_client: AbstractWebClient | None = None, parent_urls: Set[str] | None = None, quiet_404: bool = False)
Parameters:
  • url – URL of the sitemap to fetch and parse.

  • recursion_level – current recursion level of parser

  • web_client – Web client to use. If None, a RequestsWebClient will be used.

  • parent_urls – Set of parent URLs that led to this sitemap.

  • quiet_404 – Whether 404 errors are expected and should be logged at a reduced level, useful for speculative fetching of known URLs.

Raises:
sitemap() AbstractSitemap

Fetch and parse the sitemap.

Returns:

the parsed sitemap. Will be a child of AbstractSitemap. If an HTTP error is encountered, or the sitemap cannot be parsed, will be InvalidSitemap.

class usp.fetch_parse.SitemapStrParser

Bases: SitemapFetcher

Custom fetcher to parse a string instead of download from a URL.

This is a little bit hacky, but it allows us to support local content parsing without having to change too much.

__init__(static_content: str)

Init a new string parser

Parameters:

static_content – String containing sitemap text to parse

class usp.fetch_parse.AbstractSitemapParser

Abstract robots.txt / XML / plain text sitemap parser.

__init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: Set[str])
abstract sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

class usp.fetch_parse.IndexRobotsTxtSitemapParser

Bases: AbstractSitemapParser

robots.txt index sitemap parser.

__init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: Set[str])
sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

class usp.fetch_parse.PlainTextSitemapParser

Bases: AbstractSitemapParser

Plain text sitemap parser.

sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

class usp.fetch_parse.XMLSitemapParser

Bases: AbstractSitemapParser

Initial XML sitemap parser.

Instantiates an Expat parser and registers handler methods, which determine the specific format and instantiates a concrete parser (inheriting from AbstractXMLSitemapParser) to extract data.

__init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: Set[str])
sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

class usp.fetch_parse.AbstractXMLSitemapParser

Abstract XML sitemap parser.

__init__(url: str)
abstract sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

xml_char_data(data: str) None

Concrete parser handler for character data.

Multiple concurrent calls are concatenated until an XML element start or end is reached, as it may be called multiple times for a single string. E.g. ABC & DEF.

See xmlparser.CharacterDataHandler

Parameters:

data – string data

xml_element_end(name: str) None

Concrete parser handler when the end of an element is encountered.

See xmlparser.EndElementHandler

Parameters:

name – element name, potentially prefixed with namespace

xml_element_start(name: str, attrs: Dict[str, str]) None

Concrete parser handler when the start of an element is encountered.

See xmlparser.StartElementHandler

Parameters:
  • name – element name, potentially prefixed with namespace

  • attrs – element attributes

class usp.fetch_parse.IndexXMLSitemapParser

Bases: AbstractXMLSitemapParser

Index XML sitemap parser.

__init__(url: str, web_client: AbstractWebClient, recursion_level: int, parent_urls: Set[str])
sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

xml_element_end(name: str) None

Concrete parser handler when the end of an element is encountered.

See xmlparser.EndElementHandler

Parameters:

name – element name, potentially prefixed with namespace

class usp.fetch_parse.PagesXMLSitemapParser

Bases: AbstractXMLSitemapParser

Pages XML sitemap parser.

class Image

Bases: object

Data class for holding image data while parsing.

__init__()
class Page

Bases: object

Simple data class for holding various properties for a single <url> entry while parsing.

__init__()
page() SitemapPage | None

Return constructed sitemap page if one has been completed, otherwise None.

__init__(url: str)
sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

xml_element_end(name: str) None

Concrete parser handler when the end of an element is encountered.

See xmlparser.EndElementHandler

Parameters:

name – element name, potentially prefixed with namespace

xml_element_start(name: str, attrs: Dict[str, str]) None

Concrete parser handler when the start of an element is encountered.

See xmlparser.StartElementHandler

Parameters:
  • name – element name, potentially prefixed with namespace

  • attrs – element attributes

class usp.fetch_parse.PagesRSSSitemapParser

Bases: AbstractXMLSitemapParser

Pages RSS 2.0 sitemap parser.

https://validator.w3.org/feed/docs/rss2.html

class Page

Bases: object

Data class for holding various properties for a single RSS <item> while parsing.

__init__()
page() SitemapPage | None

Return constructed sitemap page if one has been completed, otherwise None.

__init__(url: str)
sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

xml_element_end(name: str) None

Concrete parser handler when the end of an element is encountered.

See xmlparser.EndElementHandler

Parameters:

name – element name, potentially prefixed with namespace

xml_element_start(name: str, attrs: Dict[str, str]) None

Concrete parser handler when the start of an element is encountered.

See xmlparser.StartElementHandler

Parameters:
  • name – element name, potentially prefixed with namespace

  • attrs – element attributes

class usp.fetch_parse.PagesAtomSitemapParser

Bases: AbstractXMLSitemapParser

Pages Atom 0.3 / 1.0 sitemap parser.

References:

class Page

Bases: object

Data class for holding various properties for a single Atom <entry> while parsing.

__init__()
page() SitemapPage | None

Return constructed sitemap page if one has been completed, otherwise None.

__init__(url: str)
sitemap() AbstractSitemap

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:

an instance of the appropriate sitemap class

xml_element_end(name: str) None

Concrete parser handler when the end of an element is encountered.

See xmlparser.EndElementHandler

Parameters:

name – element name, potentially prefixed with namespace

xml_element_start(name: str, attrs: Dict[str, str]) None

Concrete parser handler when the start of an element is encountered.

See xmlparser.StartElementHandler

Parameters:
  • name – element name, potentially prefixed with namespace

  • attrs – element attributes