usp.helpers¶
Helper utilities.
- usp.helpers.get_url_retry_on_client_errors(url: str, web_client: AbstractWebClient, retry_count: int = 5, sleep_between_retries: int = 1, quiet_404: bool = False) AbstractWebClientResponse¶
Fetch URL, retry on retryable errors.
- Parameters:
url – URL to fetch.
web_client – Web client object to use for fetching.
retry_count – How many times to retry fetching the same URL.
sleep_between_retries – How long to sleep between retries, in seconds.
quiet_404 – Whether to log 404 errors at a lower level.
- Returns:
Web client response object.
- usp.helpers.gunzip(data: bytes) bytes¶
Gunzip data.
- Raises:
GunzipException – If the data cannot be decompressed.
- Parameters:
data – Gzipped data.
- Returns:
Gunzipped data.
- usp.helpers.html_unescape_strip(string: str | None) str | None¶
Decode HTML entities, strip string, set to None if it’s empty; ignore None as input.
- Parameters:
string – String to decode HTML entities in.
- Returns:
Stripped string with HTML entities decoded; None if parameter string was empty or None.
- usp.helpers.is_http_url(url: str) bool¶
Returns true if URL is of the “http” (“https”) scheme.
- Parameters:
url – URL to test.
- Returns:
True if argument URL is of the “http” (“https”) scheme.
- usp.helpers.parse_iso8601_date(date_string: str) datetime | None¶
Parse ISO 8601 date (e.g. from sitemap’s <publication_date>) into datetime.datetime object.
- Parameters:
date_string – ISO 8601 date, e.g. “2018-01-12T21:57:27Z” or “1997-07-16T19:20:30+01:00”.
- Returns:
datetime.datetime object of a parsed date.
- usp.helpers.parse_rfc2822_date(date_string: str) datetime | None¶
Parse RFC 2822 date (e.g. from Atom’s <issued>) into datetime.datetime object.
- Parameters:
date_string – RFC 2822 date, e.g. “Tue, 10 Aug 2010 20:43:53 -0000”.
- Returns:
datetime.datetime object of a parsed date.
- usp.helpers.strip_url_to_homepage(url: str) str¶
Strip URL to its homepage.
- Raises:
StripURLToHomepageException – If URL is empty or cannot be parsed.
- Parameters:
url – URL to strip, e.g. “http://www.example.com/page.html”.
- Returns:
Stripped homepage URL, e.g. “http://www.example.com/”
- usp.helpers.ungzipped_response_content(url: str, response: AbstractWebClientSuccessResponse) str¶
Return HTTP response’s decoded content, gunzip it if necessary.
- Parameters:
url – URL the response was fetched from.
response – Response object.
- Returns:
Decoded and (if necessary) gunzipped response string.
- usp.helpers.RecurseCallbackType¶
Type for the callback function used to decide whether to recurse into a sitemap.
A function that takes the sub-sitemap URL, the current recursion level, and the set of parent URLs as arguments, and returns a boolean indicating whether to recurse into the sub-sitemap.
- usp.helpers.RecurseListCallbackType¶
Type for the callback function used to filter the list of sitemaps to recurse into.
A function that takes the list of sub-sitemap URLs, the current recursion level, and the set of parent URLs as arguments, and returns a list of sub-sitemap URLs to recurse into.