usp.fetch_parse¶

Sitemap fetchers and parsers.

class usp.fetch_parse.SitemapFetcher¶

Fetches and parses the sitemap at a given URL, and any declared sub-sitemaps.

__init__(url: str, recursion_level: int, web_client: AbstractWebClient | None = None, parent_urls: set[str] | None = None, quiet_404: bool = False, recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶

Parameters:

url – URL of the sitemap to fetch and parse.
recursion_level – current recursion level of parser
web_client – Web client to use. If None, a RequestsWebClient will be used.
parent_urls – Set of parent URLs that led to this sitemap.
quiet_404 – Whether 404 errors are expected and should be logged at a reduced level, useful for speculative fetching of known URLs.
recurse_callback – Optional callback to filter out a sub-sitemap. See RecurseCallbackType.
recurse_list_callback – Optional callback to filter the list of sub-sitemaps. See RecurseListCallbackType.

Raises:

SitemapException – If the maximum recursion depth is exceeded.
SitemapException – If the URL is in the parent URLs set.
SitemapException – If the URL is not an HTTP(S) URL

sitemap() → AbstractSitemap¶

Fetch and parse the sitemap.

Returns:: the parsed sitemap. Will be a child of AbstractSitemap. If an HTTP error is encountered, or the sitemap cannot be parsed, will be InvalidSitemap.

class usp.fetch_parse.SitemapStrParser¶

Bases: SitemapFetcher

Custom fetcher to parse a string instead of download from a URL.

This is a little bit hacky, but it allows us to support local content parsing without having to change too much.

__init__(static_content: str)¶

Init a new string parser

Parameters:: static_content – String containing sitemap text to parse

class usp.fetch_parse.AbstractSitemapParser¶

Abstract robots.txt / XML / plain text sitemap parser.

__init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶

abstractmethod sitemap() → AbstractSitemap¶

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:: an instance of the appropriate sitemap class

class usp.fetch_parse.IndexRobotsTxtSitemapParser¶

Bases: AbstractSitemapParser

robots.txt index sitemap parser.

__init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶

sitemap() → AbstractSitemap¶

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:: an instance of the appropriate sitemap class

class usp.fetch_parse.PlainTextSitemapParser¶

Bases: AbstractSitemapParser

Plain text sitemap parser.

sitemap() → AbstractSitemap¶

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:: an instance of the appropriate sitemap class

class usp.fetch_parse.XMLSitemapParser¶

Bases: AbstractSitemapParser

Initial XML sitemap parser.

Instantiates an Expat parser and registers handler methods, which determine the specific format and instantiates a concrete parser (inheriting from AbstractXMLSitemapParser) to extract data.

__init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶

sitemap() → AbstractSitemap¶

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:: an instance of the appropriate sitemap class

class usp.fetch_parse.AbstractXMLSitemapParser¶

Abstract XML sitemap parser.

__init__(url: str, recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶

abstractmethod sitemap() → AbstractSitemap¶

Create the parsed sitemap instance and perform any sub-parsing needed.

Returns:: an instance of the appropriate sitemap class

xml_char_data(data: str) → None¶

Concrete parser handler for character data.

Multiple concurrent calls are concatenated until an XML element start or end is reached, as it may be called multiple times for a single string. E.g. ABC & DEF.

See xmlparser.CharacterDataHandler

Parameters:: data – string data

xml_element_end(name: str) → None¶

Concrete parser handler when the end of an element is encountered.

See xmlparser.EndElementHandler

Parameters:: name – element name, potentially prefixed with namespace

xml_element_start(name: str, attrs: dict[str, str]) → None¶

Concrete parser handler when the start of an element is encountered.

See xmlparser.StartElementHandler

Parameters:

name – element name, potentially prefixed with namespace
attrs – element attributes

class usp.fetch_parse.IndexXMLSitemapParser¶

Bases: AbstractXMLSitemapParser

Index XML sitemap parser.

__init__(url: str, web_client: AbstractWebClient, recursion_level: int, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶