usp.fetch_parse¶
Sitemap fetchers and parsers.
- class usp.fetch_parse.SitemapFetcher¶
Fetches and parses the sitemap at a given URL, and any declared sub-sitemaps.
- __init__(url: str, recursion_level: int, web_client: AbstractWebClient | None = None, parent_urls: set[str] | None = None, quiet_404: bool = False, recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶
- Parameters:
url – URL of the sitemap to fetch and parse.
recursion_level – current recursion level of parser
web_client – Web client to use. If
None, aRequestsWebClientwill be used.parent_urls – Set of parent URLs that led to this sitemap.
quiet_404 – Whether 404 errors are expected and should be logged at a reduced level, useful for speculative fetching of known URLs.
recurse_callback – Optional callback to filter out a sub-sitemap. See
RecurseCallbackType.recurse_list_callback – Optional callback to filter the list of sub-sitemaps. See
RecurseListCallbackType.
- Raises:
SitemapException – If the maximum recursion depth is exceeded.
SitemapException – If the URL is in the parent URLs set.
SitemapException – If the URL is not an HTTP(S) URL
- sitemap() AbstractSitemap¶
Fetch and parse the sitemap.
- Returns:
the parsed sitemap. Will be a child of
AbstractSitemap. If an HTTP error is encountered, or the sitemap cannot be parsed, will beInvalidSitemap.
- class usp.fetch_parse.SitemapStrParser¶
Bases:
SitemapFetcherCustom fetcher to parse a string instead of download from a URL.
This is a little bit hacky, but it allows us to support local content parsing without having to change too much.
- class usp.fetch_parse.AbstractSitemapParser¶
Abstract robots.txt / XML / plain text sitemap parser.
- __init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶
- abstractmethod sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- class usp.fetch_parse.IndexRobotsTxtSitemapParser¶
Bases:
AbstractSitemapParserrobots.txt index sitemap parser.
- __init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶
- sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- class usp.fetch_parse.PlainTextSitemapParser¶
Bases:
AbstractSitemapParserPlain text sitemap parser.
- sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- class usp.fetch_parse.XMLSitemapParser¶
Bases:
AbstractSitemapParserInitial XML sitemap parser.
Instantiates an Expat parser and registers handler methods, which determine the specific format and instantiates a concrete parser (inheriting from
AbstractXMLSitemapParser) to extract data.- __init__(url: str, content: str, recursion_level: int, web_client: AbstractWebClient, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶
- sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- class usp.fetch_parse.AbstractXMLSitemapParser¶
Abstract XML sitemap parser.
- __init__(url: str, recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶
- abstractmethod sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- xml_char_data(data: str) None¶
Concrete parser handler for character data.
Multiple concurrent calls are concatenated until an XML element start or end is reached, as it may be called multiple times for a single string. E.g.
ABC & DEF.See
xmlparser.CharacterDataHandler- Parameters:
data – string data
- xml_element_end(name: str) None¶
Concrete parser handler when the end of an element is encountered.
See
xmlparser.EndElementHandler- Parameters:
name – element name, potentially prefixed with namespace
- class usp.fetch_parse.IndexXMLSitemapParser¶
Bases:
AbstractXMLSitemapParserIndex XML sitemap parser.
- __init__(url: str, web_client: AbstractWebClient, recursion_level: int, parent_urls: set[str], recurse_callback: Callable[[str, int, set[str]], bool] | None = None, recurse_list_callback: Callable[[list[str], int, set[str]], list[str]] | None = None)¶
- sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- xml_element_end(name: str) None¶
Concrete parser handler when the end of an element is encountered.
See
xmlparser.EndElementHandler- Parameters:
name – element name, potentially prefixed with namespace
- class usp.fetch_parse.PagesXMLSitemapParser¶
Bases:
AbstractXMLSitemapParserPages XML sitemap parser.
- class Page¶
Bases:
objectSimple data class for holding various properties for a single <url> entry while parsing.
- __init__()¶
- page() SitemapPage | None¶
Return constructed sitemap page if one has been completed, otherwise None.
- sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- xml_element_end(name: str) None¶
Concrete parser handler when the end of an element is encountered.
See
xmlparser.EndElementHandler- Parameters:
name – element name, potentially prefixed with namespace
- class usp.fetch_parse.PagesRSSSitemapParser¶
Bases:
AbstractXMLSitemapParserPages RSS 2.0 sitemap parser.
https://validator.w3.org/feed/docs/rss2.html
- class Page¶
Bases:
objectData class for holding various properties for a single RSS <item> while parsing.
- __init__()¶
- page() SitemapPage | None¶
Return constructed sitemap page if one has been completed, otherwise None.
- sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- xml_element_end(name: str) None¶
Concrete parser handler when the end of an element is encountered.
See
xmlparser.EndElementHandler- Parameters:
name – element name, potentially prefixed with namespace
- class usp.fetch_parse.PagesAtomSitemapParser¶
Bases:
AbstractXMLSitemapParserPages Atom 0.3 / 1.0 sitemap parser.
References:
- class Page¶
Bases:
objectData class for holding various properties for a single Atom <entry> while parsing.
- __init__()¶
- page() SitemapPage | None¶
Return constructed sitemap page if one has been completed, otherwise None.
- sitemap() AbstractSitemap¶
Create the parsed sitemap instance and perform any sub-parsing needed.
- Returns:
an instance of the appropriate sitemap class
- xml_element_end(name: str) None¶
Concrete parser handler when the end of an element is encountered.
See
xmlparser.EndElementHandler- Parameters:
name – element name, potentially prefixed with namespace