datopy.modeling#

Description

Tools for data modeling, validation, and raw data processing.

Note

WIP.

Overview#

Auto-generated data models

Tools for automated generation of data models from data.

`list_to_dict`	Provide a dictionary representation of a list, using indices as keys.
`compare_dict_keys`	Compare two dictionaries recursively and identify missing keys.
`apply_recursive`	Apply `func` to each terminal value in a nested data structure.
`schema_jsonify`	_summary_.

A flexible framework for ETL workflows

BaseProcessor

The fundamental data processing structure.

API#

list_to_dict( obj: list[object] | tuple[object] | set[object], max_items: int | None = None, ) → dict[int, object][source]#

Provide a dictionary representation of a list, using indices as keys.

Also compatible with other non-dictionary or string-like iterables.

Parameters:

obj (list) – A list to convert to a dictionary representation.
max_items (int, default=None) – Option to impose a limit on the number of elements to iterate over. Intended use: constructing pattern-based data models from a sample.

Returns:

The supplied list’s dictionary representation.

Return type:

dict

Examples

>>> from datopy.modeling import list_to_dict

>>> my_list = [1, 'two', [3], {'four': 5}]
>>> list_to_dict(my_list)
{1: 1, 2: 'two', 3: [3], 4: {'four': 5}}

>>> my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> list_to_dict(my_list, max_items=5)
{1: 1, 2: 2, 3: 3, 4: 4, 5: 5}

>>> my_dict = dict(a=1, b='two')
>>> list_to_dict(my_dict)
Not running conversion since obj is already a dictionary.
{'a': 1, 'b': 'two'}

compare_dict_keys( dict1: dict[object, object] | object, dict2: dict[object, object] | object, ) → dict[object, object] | str | None[source]#

Compare two dictionaries recursively and identify missing keys.

Parameters:

dict1 (dict) – The reference dictionary.
dict2 (dict) – The comparison dictionary to be checked against dict1.

Returns:

The nested dictionary of fields missing from dict2 relative to dict1.

Return type:

dict | list[str] | None

Examples

Setup

>>> from datopy.modeling import compare_dict_keys
>>> import copy
>>> dict1 = {'a1': 1, 'a2': 'two', 'a3': [3],
...          'b1': {'b11': 1, 'b12': 'two', 'b13': [3]},
...          'c1': {'c11': {'c111': 1, 'c112': 'two', 'c113': [3]}}
... }

>>> from datopy.modeling import compare_dict_keys

Identical dictionaries

>>> dict2 = copy.deepcopy(dict1)
>>> compare_dict_keys(dict1, dict2)

Missing nesting level 0 key

>>> del dict2['a1']
>>> compare_dict_keys(dict1, dict2)
{'missing_keys': ['a1']}

Missing nesting level 1 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['b1']['b12']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'b1': {'missing_keys': ['b12']}}}

Missing nesting level 2 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['c1']['c11']['c113']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'c1': {'nested_diff': {'c11': {'missing_keys': ['c113']}}}}}

apply_recursive( func: Callable[[...], Any], obj, ) → dict[str | int, Any] | Any[source]#

Apply func to each terminal value in a nested data structure.

Valid nested data structures include those with explicit or implied key/value pairs.

Parameters:

func (Callable[…, Any]) – _description_.
obj – _description_.

Returns:

A tree-like dictionary representation of the transformed obj.

Return type:

dict

Examples

>>> from datopy.modeling import apply_recursive
>>> import pprint

Define the data

>>> nested_data =  {
...     'type': 'album', 'url': 'link.com', 'audio_features': [
...         {'loudness': -11.4, 'duration_ms': 251},
...         {'loudness': -15.5, 'duration_ms': 284}
...     ]
... }
>>> pprint.pp(nested_data)
{'type': 'album',
 'url': 'link.com',
 'audio_features': [{'loudness': -11.4, 'duration_ms': 251},
                    {'loudness': -15.5, 'duration_ms': 284}]}

Convert to json-friendly representation

>>> serialized = apply_recursive(str, nested_data)
>>> pprint.pp(serialized)
{'type': 'album',
 'url': 'link.com',
 'audio_features': {1: {'loudness': '-11.4', 'duration_ms': '251'},
                    2: {'loudness': '-15.5', 'duration_ms': '284'}}}

Convert to field/type pairs

>>> schema = apply_recursive(lambda x: type(x).__name__, nested_data)
>>> pprint.pp(schema)
{'type': 'str',
 'url': 'str',
 'audio_features': {1: {'loudness': 'float', 'duration_ms': 'int'},
                    2: {'loudness': 'float', 'duration_ms': 'int'}}}

schema_jsonify( obj: dict[object, object], ) → dict[object, object][source]#

_summary_.

Parameters:: obj (dict) – _description_.
Returns:: _description_.
Return type:: dict

Examples

>>> import pprint
>>> from datopy.modeling import schema_jsonify

>>> original_schema = {
...     'name': 'str', 'quantity': 'int',
...     'features': {
...         1: {'volume': 'str', 'duration': 'float'},
...         2: {'volume': 'str', 'duration': 'float'}
...     },
...     'creator': {'person': {'name': 'str'},
...     'company': {'name': 'str', 'location': 'str'}}
... }
>>> schema = schema_jsonify(original_schema)
>>> schema = {**{"title": "title", "description": "description"}, **schema}
>>> pprint.pp(schema, compact=True, depth=3)
{'title': 'title',
 'description': 'description',
 'type': 'object',
 'properties': {'name': {'type': 'string'},
                'quantity': {'type': 'number'},
                'features': {'type': 'array',
                             'minItems': 1,
                             'maxItems': 2,
                             'uniqueItems': True,
                             'items': {...}},
                'creator': {'type': 'object',
                            'properties': {...},
                            'required': [...]}},
 'required': ['name', 'quantity', 'features', 'creator']}

class CustomTypes[source]#

Bases: object

Define reusable custom field types.

Notes

Whitespace around commas should be stripped before analysis. For additional info on Pydantic custom types, see: https://docs.pydantic.dev/latest/concepts/types/.

Methods

CSVnumsent
CSVnumstr
CSVstr

CSVstr#

Lowercase comma-separated string. Excludes numerics and special characters.

alias of Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes : CSVstr’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z, ]+$’)])]

CSVnumstr#

Lowercase comma-separated string. Allows numerics; excludes special characters.

alias of Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes : CSVnumstr’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z0-9,.! ]+$’)])]

CSVnumsent#: alias of Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes : CSVnumsent’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z0-9,.! ]+$’)])]

class BaseProcessor( model: BaseModel, query: NamedTuple, )[source]#

Bases: object

The fundamental data processing structure.

Parameters:

model (BaseModel) – _description_.
query (NamedTuple) – _description_.

Methods

`process`()	Prepare (extract/clean) the retrieved data.
`retrieve`()	Extract data for the query from the API of the supplied model.
`to_df`()	Load the data into a dataframe for further processing or analysis.

retrieve()[source]#

Extract data for the query from the API of the supplied model.

Raises:: NotImplementedError – _description_.

process()[source]#

Prepare (extract/clean) the retrieved data.

Raises:: NotImplementedError – _description_.

to_df() → DataFrame[source]#

Load the data into a dataframe for further processing or analysis.

Returns:: The processed entry as a data frame.
Return type:: pd.DataFrame

Classes

`BaseProcessor`(model, query)	The fundamental data processing structure.
`CustomTypes`()	Define reusable custom field types.

Functions

`apply_recursive`(func, obj)	Apply `func` to each terminal value in a nested data structure.
`compare_dict_keys`(dict1, dict2)	Compare two dictionaries recursively and identify missing keys.
`list_to_dict`(obj[, max_items])	Provide a dictionary representation of a list, using indices as keys.
`schema_jsonify`(obj)	_summary_.