usp.helpers

Helper utilities.

usp.helpers.get_url_retry_on_client_errors(url: str, web_client: AbstractWebClient, retry_count: int = 5, sleep_between_retries: int = 1) AbstractWebClientResponse

Fetch URL, retry on retryable errors.

Parameters:
  • url – URL to fetch.

  • web_client – Web client object to use for fetching.

  • retry_count – How many times to retry fetching the same URL.

  • sleep_between_retries – How long to sleep between retries, in seconds.

Returns:

Web client response object.

usp.helpers.gunzip(data: bytes) bytes

Gunzip data.

Raises:

GunzipException – If the data cannot be decompressed.

Parameters:

data – Gzipped data.

Returns:

Gunzipped data.

usp.helpers.html_unescape_strip(string: str | None) str | None

Decode HTML entities, strip string, set to None if it’s empty; ignore None as input.

Parameters:

string – String to decode HTML entities in.

Returns:

Stripped string with HTML entities decoded; None if parameter string was empty or None.

usp.helpers.is_http_url(url: str) bool

Returns true if URL is of the “http” (“https”) scheme.

Parameters:

url – URL to test.

Returns:

True if argument URL is of the “http” (“https”) scheme.

usp.helpers.parse_iso8601_date(date_string: str) datetime | None

Parse ISO 8601 date (e.g. from sitemap’s <publication_date>) into datetime.datetime object.

Parameters:

date_string – ISO 8601 date, e.g. “2018-01-12T21:57:27Z” or “1997-07-16T19:20:30+01:00”.

Returns:

datetime.datetime object of a parsed date.

usp.helpers.parse_rfc2822_date(date_string: str) datetime | None

Parse RFC 2822 date (e.g. from Atom’s <issued>) into datetime.datetime object.

Parameters:

date_string – RFC 2822 date, e.g. “Tue, 10 Aug 2010 20:43:53 -0000”.

Returns:

datetime.datetime object of a parsed date.

usp.helpers.strip_url_to_homepage(url: str) str

Strip URL to its homepage.

Raises:

StripURLToHomepageException – If URL is empty or cannot be parsed.

Parameters:

url – URL to strip, e.g. “http://www.example.com/page.html”.

Returns:

Stripped homepage URL, e.g. “http://www.example.com/

usp.helpers.ungzipped_response_content(url: str, response: AbstractWebClientSuccessResponse) str

Return HTTP response’s decoded content, gunzip it if necessary.

Parameters:
  • url – URL the response was fetched from.

  • response – Response object.

Returns:

Decoded and (if necessary) gunzipped response string.