datopy.etl#

Description

Tools for efficient web-based data retrieval, data processing, table creation, and populating empty metadata fields.

Note

WIP.

Overview#

Extract

Utilities for data retrieval.

Transform

Basic data processing and transformation of raw data.

omit_string_patterns

Prune multiple character patterns from a string.

Load

Utilities related to finding and loading data into a database.

retrieve_wiki_topics

Compile a list of related topics by scraping a Wikipedia page.

API#

omit_string_patterns( input_string: str, patterns: list[str], ) → str[source]#

Prune multiple character patterns from a string.

Parameters:

input_string (str) – The to-be-cleaned string.
patterns (list[str]) – A list of patterns to omit from the string.

Returns:

The input string with the supplied patterns ommitted.

Return type:

str

Examples

>>> from datopy.etl import omit_string_patterns

>>> input_string = "[[A \\\\ messy * string * with undesirable /patterns]]"
>>> patterns_to_omit = ["[[", "]]", "* ", "\\\\ ", "/", "messy ", "un" ]
>>> output_string = omit_string_patterns(input_string, patterns_to_omit)
>>> print(output_string)
A string with desirable patterns

retrieve_wiki_topics( listing_page: str, verbose: bool = True, ) → list[str][source]#

Compile a list of related topics by scraping a Wikipedia page.

Parameters:

listing_page (str) – The title of a Wikipedia article containing topics to be retrieved.
verbose (bool, default=True) – Option to enable/disable printouts.

Returns:

A list of topics (by article name) extracted from the listing page.

Return type:

list[str]

Notes

Only hyperlinked topics (those with a Wikipedia page) are retrieved. Search Wikipedia’s catalogue of listing pages here: https://en.wikipedia.org/wiki/List_of_lists_of_lists

Functions

`omit_string_patterns`(input_string, patterns)	Prune multiple character patterns from a string.
`retrieve_wiki_topics`(listing_page[, verbose])	Compile a list of related topics by scraping a Wikipedia page.