datopy.modeling#

Description

Tools for data modeling, validation, and raw data processing, including auto-generated data models and a flexible framework for ETL workflows.

class BaseProcessor(
model: BaseModel,
query: NamedTuple,
)[source]#

Bases: object

_summary_

Parameters:
  • model (BaseModel) – _description_

  • query (NamedTuple) – _description_

Methods

process()

Process (extract/clean) retrieved data.

retrieve()

Retrieve data for the query from the API of the supplied model.

to_df()

Load the data into a dataframe for further processing or analysis.

process()[source]#

Process (extract/clean) retrieved data.

Raises:

NotImplementedError – _description_:

retrieve()[source]#

Retrieve data for the query from the API of the supplied model.

Raises:

NotImplementedError – _description_:

to_df()[source]#

Load the data into a dataframe for further processing or analysis.

class CustomTypes[source]#

Bases: object

Reusable custom field types. Whitespace around commas should be stripped before analysis.

Methods

CSVnumsent

CSVnumstr

CSVstr

CSVnumsent#

alias of Annotated[str]

CSVnumstr#

alias of Annotated[str]

CSVstr#

alias of Annotated[str]

apply_recursive(
func: Callable[[...], Any],
obj,
) dict[str | int, Any] | Any[source]#

Convert a nested data structure (with explicit or implied key/value pairs) into a tree-like dictionary, applying a given function to terminal values.

Parameters:
  • func (Callable[…, Any]) – _description_

  • obj – _description_

Returns:

_description_

Return type:

dict

Examples

>>> from datopy.modeling import apply_recursive

Define the data

>>> nested_data =  {'type': 'album', 'url': 'link.com', 'audio_features': [
...     {'loudness': -11.4, 'duration_ms': 251},
...     {'loudness': -15.5, 'duration_ms': 284}]}
>>> print(nested_data)
{'type': 'album', 'url': 'link.com', 'audio_features': [{'loudness': -11.4, 'duration_ms': 251}, {'loudness': -15.5, 'duration_ms': 284}]}

Convert to json-friendly representation

>>> serialized = apply_recursive(str, nested_data)
>>> print(serialized)
{'type': 'album', 'url': 'link.com', 'audio_features': {1: {'loudness': '-11.4', 'duration_ms': '251'}, 2: {'loudness': '-15.5', 'duration_ms': '284'}}}

Convert to field/type pairs

>>> schema = apply_recursive(lambda x: type(x).__name__, nested_data)
>>> print(schema)
{'type': 'str', 'url': 'str', 'audio_features': {1: {'loudness': 'float', 'duration_ms': 'int'}, 2: {'loudness': 'float', 'duration_ms': 'int'}}}
compare_dict_keys(
dict1: dict[object, object] | object,
dict2: dict[object, object] | object,
) dict[object, object] | str | None[source]#

Recursively compare two dictionaries and identify missing keys.

Parameters:
  • dict1 (dict) – The reference dictionary.

  • dict2 (dict) – The comparison dictionary to be checked against dict1.

Returns:

result – The nested dictionary of fields missing from dict2 relative dict1.

Return type:

dict | List[str] | None

Examples

Setup

>>> from datopy.modeling import compare_dict_keys
>>> import copy
>>> dict1 = {'a1': 1, 'a2': 'two', 'a3': [3],
...          'b1': {'b11': 1, 'b12': 'two', 'b13': [3]},
...          'c1': {'c11': {'c111': 1, 'c112': 'two', 'c113': [3]}}
... }
>>> from datopy.modeling import compare_dict_keys

Identical dictionaries

>>> dict2 = copy.deepcopy(dict1)
>>> compare_dict_keys(dict1, dict2)

Missing nesting level 0 key

>>> del dict2['a1']
>>> compare_dict_keys(dict1, dict2)
{'missing_keys': ['a1']}

Missing nesting level 1 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['b1']['b12']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'b1': {'missing_keys': ['b12']}}}

Missing nesting level 2 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['c1']['c11']['c113']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'c1': {'nested_diff': {'c11': {'missing_keys': ['c113']}}}}}
list_to_dict(
obj: list[object] | tuple[object] | set[object],
max_items: int | None = None,
) dict[int, object][source]#

Provide a dictionary representation of a list or other non-dictionary or string-like iterable, using indices as keys.

Parameters:
  • obj (list) – A list to convert to a dictionary representation.

  • max_items (int, default=None) – Option to impose a limit on the number of elements to iterate over. Intended use: constructing pattern-based data models from a sample.

Returns:

res – The supplied list’s dictionary representation.

Return type:

dict

Examples

>>> from datopy.modeling import list_to_dict
>>> my_list = [1, 'two', [3], {'four': 5}]
>>> list_to_dict(my_list)
{1: 1, 2: 'two', 3: [3], 4: {'four': 5}}
>>> my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> list_to_dict(my_list, max_items=5)
{1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
>>> my_dict = dict(a=1, b='two')
>>> list_to_dict(my_dict)
Not running conversion since obj is already a dictionary.
{'a': 1, 'b': 'two'}
schema_jsonify(
obj: dict[object, object],
) dict[object, object][source]#

_summary_

Parameters:

schema (dict) – _description_

Returns:

dict

Return type:

_description_

Examples

>>> import pprint
>>> from datopy.modeling import schema_jsonify
>>> original_schema = {'name': 'str', 'quantity': 'int', 'features': {1: {'volume': 'str', 'duration': 'float'}, 2: {'volume': 'str', 'duration': 'float'}}, 'creator': {'person': {'name': 'str'}, 'company': {'name': 'str', 'location': 'str'}}}
>>> schema = schema_jsonify(original_schema)
>>> schema = {**{"title": "title", "description": "description"}, **schema}
>>> pprint.pp(schema, compact=True, depth=3)
{'title': 'title',
 'description': 'description',
 'type': 'object',
 'properties': {'name': {'type': 'string'},
                'quantity': {'type': 'number'},
                'features': {'type': 'array',
                             'minItems': 1,
                             'maxItems': 2,
                             'uniqueItems': True,
                             'items': {...}},
                'creator': {'type': 'object',
                            'properties': {...},
                            'required': [...]}},
 'required': ['name', 'quantity', 'features', 'creator']}

Classes

BaseProcessor(model, query)

_summary_

CustomTypes()

Reusable custom field types.

Functions

apply_recursive(func, obj)

Convert a nested data structure (with explicit or implied key/value pairs) into a tree-like dictionary, applying a given function to terminal values.

compare_dict_keys(dict1, dict2)

Recursively compare two dictionaries and identify missing keys.

list_to_dict(obj[, max_items])

Provide a dictionary representation of a list or other non-dictionary or string-like iterable, using indices as keys.

schema_jsonify(obj)

_summary_