datopy.modeling#

Description

Tools for data modeling, validation, and raw data processing.

Note

WIP.

Overview#

Auto-generated data models

Tools for automated generation of data models from data.

list_to_dict

Provide a dictionary representation of a list, using indices as keys.

compare_dict_keys

Compare two dictionaries recursively and identify missing keys.

apply_recursive

Apply func to each terminal value in a nested data structure.

schema_jsonify

_summary_.

A flexible framework for ETL workflows

BaseProcessor

The fundamental data processing structure.

API#

list_to_dict(
obj: list[object] | tuple[object] | set[object],
max_items: int | None = None,
) dict[int, object][source]#

Provide a dictionary representation of a list, using indices as keys.

Also compatible with other non-dictionary or string-like iterables.

Parameters:
  • obj (list) – A list to convert to a dictionary representation.

  • max_items (int, default=None) – Option to impose a limit on the number of elements to iterate over. Intended use: constructing pattern-based data models from a sample.

Returns:

The supplied list’s dictionary representation.

Return type:

dict

Examples

>>> from datopy.modeling import list_to_dict
>>> my_list = [1, 'two', [3], {'four': 5}]
>>> list_to_dict(my_list)
{1: 1, 2: 'two', 3: [3], 4: {'four': 5}}
>>> my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> list_to_dict(my_list, max_items=5)
{1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
>>> my_dict = dict(a=1, b='two')
>>> list_to_dict(my_dict)
Not running conversion since obj is already a dictionary.
{'a': 1, 'b': 'two'}
compare_dict_keys(
dict1: dict[object, object] | object,
dict2: dict[object, object] | object,
) dict[object, object] | str | None[source]#

Compare two dictionaries recursively and identify missing keys.

Parameters:
  • dict1 (dict) – The reference dictionary.

  • dict2 (dict) – The comparison dictionary to be checked against dict1.

Returns:

The nested dictionary of fields missing from dict2 relative to dict1.

Return type:

dict | list[str] | None

Examples

Setup

>>> from datopy.modeling import compare_dict_keys
>>> import copy
>>> dict1 = {'a1': 1, 'a2': 'two', 'a3': [3],
...          'b1': {'b11': 1, 'b12': 'two', 'b13': [3]},
...          'c1': {'c11': {'c111': 1, 'c112': 'two', 'c113': [3]}}
... }
>>> from datopy.modeling import compare_dict_keys

Identical dictionaries

>>> dict2 = copy.deepcopy(dict1)
>>> compare_dict_keys(dict1, dict2)

Missing nesting level 0 key

>>> del dict2['a1']
>>> compare_dict_keys(dict1, dict2)
{'missing_keys': ['a1']}

Missing nesting level 1 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['b1']['b12']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'b1': {'missing_keys': ['b12']}}}

Missing nesting level 2 key

>>> dict2 = copy.deepcopy(dict1)
>>> del dict2['c1']['c11']['c113']
>>> compare_dict_keys(dict1, dict2)
{'nested_diff': {'c1': {'nested_diff': {'c11': {'missing_keys': ['c113']}}}}}
apply_recursive(
func: Callable[[...], Any],
obj,
) dict[str | int, Any] | Any[source]#

Apply func to each terminal value in a nested data structure.

Valid nested data structures include those with explicit or implied key/value pairs.

Parameters:
  • func (Callable[…, Any]) – _description_.

  • obj – _description_.

Returns:

A tree-like dictionary representation of the transformed obj.

Return type:

dict

Examples

>>> from datopy.modeling import apply_recursive
>>> import pprint

Define the data

>>> nested_data =  {
...     'type': 'album', 'url': 'link.com', 'audio_features': [
...         {'loudness': -11.4, 'duration_ms': 251},
...         {'loudness': -15.5, 'duration_ms': 284}
...     ]
... }
>>> pprint.pp(nested_data)
{'type': 'album',
 'url': 'link.com',
 'audio_features': [{'loudness': -11.4, 'duration_ms': 251},
                    {'loudness': -15.5, 'duration_ms': 284}]}

Convert to json-friendly representation

>>> serialized = apply_recursive(str, nested_data)
>>> pprint.pp(serialized)
{'type': 'album',
 'url': 'link.com',
 'audio_features': {1: {'loudness': '-11.4', 'duration_ms': '251'},
                    2: {'loudness': '-15.5', 'duration_ms': '284'}}}

Convert to field/type pairs

>>> schema = apply_recursive(lambda x: type(x).__name__, nested_data)
>>> pprint.pp(schema)
{'type': 'str',
 'url': 'str',
 'audio_features': {1: {'loudness': 'float', 'duration_ms': 'int'},
                    2: {'loudness': 'float', 'duration_ms': 'int'}}}
schema_jsonify(
obj: dict[object, object],
) dict[object, object][source]#

_summary_.

Parameters:

obj (dict) – _description_.

Returns:

_description_.

Return type:

dict

Examples

>>> import pprint
>>> from datopy.modeling import schema_jsonify
>>> original_schema = {
...     'name': 'str', 'quantity': 'int',
...     'features': {
...         1: {'volume': 'str', 'duration': 'float'},
...         2: {'volume': 'str', 'duration': 'float'}
...     },
...     'creator': {'person': {'name': 'str'},
...     'company': {'name': 'str', 'location': 'str'}}
... }
>>> schema = schema_jsonify(original_schema)
>>> schema = {**{"title": "title", "description": "description"}, **schema}
>>> pprint.pp(schema, compact=True, depth=3)
{'title': 'title',
 'description': 'description',
 'type': 'object',
 'properties': {'name': {'type': 'string'},
                'quantity': {'type': 'number'},
                'features': {'type': 'array',
                             'minItems': 1,
                             'maxItems': 2,
                             'uniqueItems': True,
                             'items': {...}},
                'creator': {'type': 'object',
                            'properties': {...},
                            'required': [...]}},
 'required': ['name', 'quantity', 'features', 'creator']}
class CustomTypes[source]#

Bases: object

Define reusable custom field types.

Notes

Whitespace around commas should be stripped before analysis. For additional info on Pydantic custom types, see: https://docs.pydantic.dev/latest/concepts/types/.

Methods

CSVnumsent

CSVnumstr

CSVstr

CSVstr#

Lowercase comma-separated string. Excludes numerics and special characters.

alias of Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes : CSVstr’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z, ]+$’)])]

CSVnumstr#

Lowercase comma-separated string. Allows numerics; excludes special characters.

alias of Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes : CSVnumstr’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z0-9,.! ]+$’)])]

CSVnumsent#

alias of Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes : CSVnumsent’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z0-9,.! ]+$’)])]

class BaseProcessor(
model: BaseModel,
query: NamedTuple,
)[source]#

Bases: object

The fundamental data processing structure.

Parameters:
  • model (BaseModel) – _description_.

  • query (NamedTuple) – _description_.

Methods

process()

Prepare (extract/clean) the retrieved data.

retrieve()

Extract data for the query from the API of the supplied model.

to_df()

Load the data into a dataframe for further processing or analysis.

retrieve()[source]#

Extract data for the query from the API of the supplied model.

Raises:

NotImplementedError – _description_.

process()[source]#

Prepare (extract/clean) the retrieved data.

Raises:

NotImplementedError – _description_.

to_df() DataFrame[source]#

Load the data into a dataframe for further processing or analysis.

Returns:

The processed entry as a data frame.

Return type:

pd.DataFrame


Classes

BaseProcessor(model, query)

The fundamental data processing structure.

CustomTypes()

Define reusable custom field types.

Functions

apply_recursive(func, obj)

Apply func to each terminal value in a nested data structure.

compare_dict_keys(dict1, dict2)

Compare two dictionaries recursively and identify missing keys.

list_to_dict(obj[, max_items])

Provide a dictionary representation of a list, using indices as keys.

schema_jsonify(obj)

_summary_.