datopy.etl#

Description

Tools for efficient web-based data retrieval, data processing, table creation, and populating empty metadata fields.

Note

WIP.

Overview#

Extract

Utilities for data retrieval.

Transform

Basic data processing and transformation of raw data.

omit_string_patterns

Prune multiple character patterns from a string.

Load

Utilities related to finding and loading data into a database.

retrieve_wiki_topics

Compile a list of related topics by scraping a Wikipedia page.

API#

omit_string_patterns(
input_string: str,
patterns: list[str],
) str[source]#

Prune multiple character patterns from a string.

Parameters:
  • input_string (str) – The to-be-cleaned string.

  • patterns (list[str]) – A list of patterns to omit from the string.

Returns:

The input string with the supplied patterns ommitted.

Return type:

str

Examples

>>> from datopy.etl import omit_string_patterns
>>> input_string = "[[A \\\\ messy * string * with undesirable /patterns]]"
>>> patterns_to_omit = ["[[", "]]", "* ", "\\\\ ", "/", "messy ", "un" ]
>>> output_string = omit_string_patterns(input_string, patterns_to_omit)
>>> print(output_string)
A string with desirable patterns
retrieve_wiki_topics(
listing_page: str,
verbose: bool = True,
) list[str][source]#

Compile a list of related topics by scraping a Wikipedia page.

Parameters:
  • listing_page (str) – The title of a Wikipedia article containing topics to be retrieved.

  • verbose (bool, default=True) – Option to enable/disable printouts.

Returns:

A list of topics (by article name) extracted from the listing page.

Return type:

list[str]

Notes

Only hyperlinked topics (those with a Wikipedia page) are retrieved. Search Wikipedia’s catalogue of listing pages here: https://en.wikipedia.org/wiki/List_of_lists_of_lists


Functions

omit_string_patterns(input_string, patterns)

Prune multiple character patterns from a string.

retrieve_wiki_topics(listing_page[, verbose])

Compile a list of related topics by scraping a Wikipedia page.