datopy.etl#
Description
Tools for efficient web-based data retrieval, data processing, table creation, and populating empty metadata fields.
Note
WIP.
Overview#
Extract
Utilities for data retrieval.
Transform
Basic data processing and transformation of raw data.
Prune multiple character patterns from a string. |
Load
Utilities related to finding and loading data into a database.
Compile a list of related topics by scraping a Wikipedia page. |
API#
- omit_string_patterns( ) str[source]#
Prune multiple character patterns from a string.
- Parameters:
input_string (str) – The to-be-cleaned string.
patterns (list[str]) – A list of patterns to omit from the string.
- Returns:
The input string with the supplied patterns ommitted.
- Return type:
Examples
>>> from datopy.etl import omit_string_patterns
>>> input_string = "[[A \\\\ messy * string * with undesirable /patterns]]" >>> patterns_to_omit = ["[[", "]]", "* ", "\\\\ ", "/", "messy ", "un" ] >>> output_string = omit_string_patterns(input_string, patterns_to_omit) >>> print(output_string) A string with desirable patterns
- retrieve_wiki_topics( ) list[str][source]#
Compile a list of related topics by scraping a Wikipedia page.
- Parameters:
listing_page (str) – The title of a Wikipedia article containing topics to be retrieved.
verbose (bool, default=True) – Option to enable/disable printouts.
- Returns:
A list of topics (by article name) extracted from the listing page.
- Return type:
Notes
Only hyperlinked topics (those with a Wikipedia page) are retrieved. Search Wikipedia’s catalogue of listing pages here: https://en.wikipedia.org/wiki/List_of_lists_of_lists
Functions
|
Prune multiple character patterns from a string. |
|
Compile a list of related topics by scraping a Wikipedia page. |