datopy.etl#

Description

Tools for efficient web-based data retrieval, data processing, table creation, and populating empty metadata fields.

omit_string_patterns(
input_string: str,
patterns: List[str],
) str[source]#

Helper to prune multiple character patterns from a string at once.

Parameters:
  • input_string (str) – The to-be-cleaned string.

  • patterns (List[str]) – A list of patterns to omit from the string.

Returns:

str

Return type:

The input string with the supplied patterns ommitted.

Examples

>>> from datopy.etl import omit_string_patterns
>>> input_string = "[[A \\\\ messy * string * with undesirable /patterns]]"
>>> print(input_string)
[[A \\ messy * string * with undesirable /patterns]]
>>> patterns_to_omit = ["[[", "]]", "* ", "\\\\ ", "/", "messy ", "un" ]
>>> output_string = omit_string_patterns(input_string, patterns_to_omit)
>>> print(output_string)
A string with desirable patterns
retrieve_wiki_topics(
listing_page: str,
verbose: bool = True,
) List[str][source]#

_summary_

Notes

Only hyperlinked topics (those with a Wikipedia page) are retrieved. Search Wikipedia’s catalogue of listing pages here: https://en.wikipedia.org/wiki/List_of_lists_of_lists

Parameters:
  • listing_page (str) – The title of a Wikipedia article containing topics to be retrieved.

  • verbose (bool, default=True) – Option to enable/disable printouts.

Returns:

target_pages – A list of topics (by article name) extracted from the listing page.

Return type:

List[str]

Functions

omit_string_patterns(input_string, patterns)

Helper to prune multiple character patterns from a string at once.

retrieve_wiki_topics(listing_page[, verbose])

_summary_