datopy.modeling#

Description

Tools for data modeling, validation, and raw data processing, including auto-generated data models and a flexible framework for ETL workflows.

class BaseProcessor( model: BaseModel, query: NamedTuple, )[source]#

Bases: object

_summary_

Parameters:

model (BaseModel) – _description_
query (NamedTuple) – _description_

Methods

`process`()	Process (extract/clean) retrieved data.
`retrieve`()	Retrieve data for the query from the API of the supplied model.
`to_df`()	Load the data into a dataframe for further processing or analysis.

process()[source]#

Process (extract/clean) retrieved data.

Raises:: NotImplementedError – _description_:

retrieve()[source]#

Retrieve data for the query from the API of the supplied model.

Raises:: NotImplementedError – _description_:

to_df()[source]#: Load the data into a dataframe for further processing or analysis.

class CustomTypes[source]#

Bases: object

Reusable custom field types. Whitespace around commas should be stripped before analysis.

Methods

CSVnumsent
CSVnumstr
CSVstr

CSVnumsent#: alias of Annotated[str]

CSVnumstr#: alias of Annotated[str]

CSVstr#: alias of Annotated[str]

apply_recursive( func: Callable[[...], Any], obj, ) → dict[str | int, Any] | Any[source]#

Convert a nested data structure (with explicit or implied key/value pairs) into a tree-like dictionary, applying a given function to terminal values.

Parameters:

func (Callable[…, Any]) – _description_
obj – _description_

Returns:

_description_

Return type:

dict

Examples

>>> from datopy.modeling import apply_recursive

Define the data

>>> nested_data =  {'type': 'album', 'url': 'link.com', 'audio_features': [
...     {'loudness': -11.4, 'duration_ms': 251},
...     {'loudness': -15.5, 'duration_ms': 284}]}
>>> print(nested_data)
{'type': 'album', 'url': 'link.com', 'audio_features': [{'loudness': -11.4, 'duration_ms': 251}, {'loudness': -15.5, 'duration_ms': 284}]}

Convert to json-friendly representation

>>> serialized = apply_recursive(str, nested_data)
>>> print(serialized)
{'type': 'album', 'url': 'link.com', 'audio_features': {1: {'loudness': '-11.4', 'duration_ms': '251'}, 2: {'loudness': '-15.5', 'duration_ms': '284'}}}

Convert to field/type pairs

>>> schema = apply_recursive(lambda x: type(x).__name__, nested_data)
>>> print(schema)
{'type': 'str', 'url': 'str', 'audio_features': {1: {'loudness': 'float', 'duration_ms': 'int'}, 2: {'loudness': 'float', 'duration_ms': 'int'}}}

compare_dict_keys( dict1: dict[object, object] | object, dict2: dict[object, object] | object, ) → dict[object, object] | str | None[source]#

Recursively compare two dictionaries and identify missing keys.

Parameters:

dict1 (dict) – The reference dictionary.
dict2 (dict) – The comparison dictionary to be checked against dict1.

Returns:

result – The nested dictionary of fields missing from dict2 relative dict1.

Return type:

dict | List[str] | None

Examples

Setup

>>> from datopy.modeling import compare_dict_keys
>>> import copy
>>> dict1 = {'a1': 1, 'a2': 'two', 'a3': [3],
...          'b1': {'b11': 1, 'b12': 'two', 'b13': [3]},
...          'c1': {'c11': {'c111': 1, 'c112': 'two', 'c113': [3]}}
... }

>>> from datopy.modeling import compare_dict_keys

Identical dictionaries

>>> dict2 = copy.deepcopy(dict1)
>>> compare_dict_keys(dict1, dict2)

Missing nesting level 0 key

>>> del dict2['a1']
>>> compare_dict_keys(dict1, dict2)
{'missing_keys': ['a1']}

Missing nesting level 1 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['b1']['b12']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'b1': {'missing_keys': ['b12']}}}

Missing nesting level 2 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['c1']['c11']['c113']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'c1': {'nested_diff': {'c11': {'missing_keys': ['c113']}}}}}

list_to_dict( obj: list[object] | tuple[object] | set[object], max_items: int | None = None, ) → dict[int, object][source]#

Provide a dictionary representation of a list or other non-dictionary or string-like iterable, using indices as keys.

Parameters:

obj (list) – A list to convert to a dictionary representation.
max_items (int, default=None) – Option to impose a limit on the number of elements to iterate over. Intended use: constructing pattern-based data models from a sample.

Returns:

res – The supplied list’s dictionary representation.

Return type:

dict

Examples

>>> from datopy.modeling import list_to_dict

>>> my_list = [1, 'two', [3], {'four': 5}]
>>> list_to_dict(my_list)
{1: 1, 2: 'two', 3: [3], 4: {'four': 5}}

>>> my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> list_to_dict(my_list, max_items=5)
{1: 1, 2: 2, 3: 3, 4: 4, 5: 5}

>>> my_dict = dict(a=1, b='two')
>>> list_to_dict(my_dict)
Not running conversion since obj is already a dictionary.
{'a': 1, 'b': 'two'}

schema_jsonify( obj: dict[object, object], ) → dict[object, object][source]#

_summary_

Parameters:: schema (dict) – _description_
Returns:: dict
Return type:: _description_

Examples

>>> import pprint
>>> from datopy.modeling import schema_jsonify

>>> original_schema = {'name': 'str', 'quantity': 'int', 'features': {1: {'volume': 'str', 'duration': 'float'}, 2: {'volume': 'str', 'duration': 'float'}}, 'creator': {'person': {'name': 'str'}, 'company': {'name': 'str', 'location': 'str'}}}
>>> schema = schema_jsonify(original_schema)
>>> schema = {**{"title": "title", "description": "description"}, **schema}
>>> pprint.pp(schema, compact=True, depth=3)
{'title': 'title',
 'description': 'description',
 'type': 'object',
 'properties': {'name': {'type': 'string'},
                'quantity': {'type': 'number'},
                'features': {'type': 'array',
                             'minItems': 1,
                             'maxItems': 2,
                             'uniqueItems': True,
                             'items': {...}},
                'creator': {'type': 'object',
                            'properties': {...},
                            'required': [...]}},
 'required': ['name', 'quantity', 'features', 'creator']}

Classes

`BaseProcessor`(model, query)	_summary_
`CustomTypes`()	Reusable custom field types.

Functions

`apply_recursive`(func, obj)	Convert a nested data structure (with explicit or implied key/value pairs) into a tree-like dictionary, applying a given function to terminal values.
`compare_dict_keys`(dict1, dict2)	Recursively compare two dictionaries and identify missing keys.
`list_to_dict`(obj[, max_items])	Provide a dictionary representation of a list or other non-dictionary or string-like iterable, using indices as keys.
`schema_jsonify`(obj)	_summary_