datopy.modeling#
Description
Tools for data modeling, validation, and raw data processing.
Note
WIP.
Overview#
Auto-generated data models
Tools for automated generation of data models from data.
Provide a dictionary representation of a list, using indices as keys. |
|
Compare two dictionaries recursively and identify missing keys. |
|
Apply |
|
_summary_. |
A flexible framework for ETL workflows
The fundamental data processing structure. |
API#
- list_to_dict( ) dict[int, object][source]#
Provide a dictionary representation of a list, using indices as keys.
Also compatible with other non-dictionary or string-like iterables.
- Parameters:
obj (list) – A list to convert to a dictionary representation.
max_items (int, default=None) – Option to impose a limit on the number of elements to iterate over. Intended use: constructing pattern-based data models from a sample.
- Returns:
The supplied list’s dictionary representation.
- Return type:
Examples
>>> from datopy.modeling import list_to_dict
>>> my_list = [1, 'two', [3], {'four': 5}] >>> list_to_dict(my_list) {1: 1, 2: 'two', 3: [3], 4: {'four': 5}}
>>> my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> list_to_dict(my_list, max_items=5) {1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
>>> my_dict = dict(a=1, b='two') >>> list_to_dict(my_dict) Not running conversion since obj is already a dictionary. {'a': 1, 'b': 'two'}
- compare_dict_keys( ) dict[object, object] | str | None[source]#
Compare two dictionaries recursively and identify missing keys.
- Parameters:
dict1 (dict) – The reference dictionary.
dict2 (dict) – The comparison dictionary to be checked against
dict1.
- Returns:
The nested dictionary of fields missing from
dict2relative todict1.- Return type:
Examples
Setup
>>> from datopy.modeling import compare_dict_keys >>> import copy >>> dict1 = {'a1': 1, 'a2': 'two', 'a3': [3], ... 'b1': {'b11': 1, 'b12': 'two', 'b13': [3]}, ... 'c1': {'c11': {'c111': 1, 'c112': 'two', 'c113': [3]}} ... }
>>> from datopy.modeling import compare_dict_keys
Identical dictionaries
>>> dict2 = copy.deepcopy(dict1) >>> compare_dict_keys(dict1, dict2)
Missing nesting level 0 key
>>> del dict2['a1'] >>> compare_dict_keys(dict1, dict2) {'missing_keys': ['a1']}
Missing nesting level 1 key
>>> dict2 = copy.deepcopy(dict1) >>> del dict2['b1']['b12'] >>> compare_dict_keys(dict1, dict2) {'nested_diff': {'b1': {'missing_keys': ['b12']}}}
Missing nesting level 2 key
>>> dict2 = copy.deepcopy(dict1) >>> del dict2['c1']['c11']['c113'] >>> compare_dict_keys(dict1, dict2) {'nested_diff': {'c1': {'nested_diff': {'c11': {'missing_keys': ['c113']}}}}}
- apply_recursive( ) dict[str | int, Any] | Any[source]#
Apply
functo each terminal value in a nested data structure.Valid nested data structures include those with explicit or implied key/value pairs.
- Parameters:
func (Callable[…, Any]) – _description_.
obj – _description_.
- Returns:
A tree-like dictionary representation of the transformed
obj.- Return type:
Examples
>>> from datopy.modeling import apply_recursive >>> import pprint
Define the data
>>> nested_data = { ... 'type': 'album', 'url': 'link.com', 'audio_features': [ ... {'loudness': -11.4, 'duration_ms': 251}, ... {'loudness': -15.5, 'duration_ms': 284} ... ] ... } >>> pprint.pp(nested_data) {'type': 'album', 'url': 'link.com', 'audio_features': [{'loudness': -11.4, 'duration_ms': 251}, {'loudness': -15.5, 'duration_ms': 284}]}
Convert to json-friendly representation
>>> serialized = apply_recursive(str, nested_data) >>> pprint.pp(serialized) {'type': 'album', 'url': 'link.com', 'audio_features': {1: {'loudness': '-11.4', 'duration_ms': '251'}, 2: {'loudness': '-15.5', 'duration_ms': '284'}}}
Convert to field/type pairs
>>> schema = apply_recursive(lambda x: type(x).__name__, nested_data) >>> pprint.pp(schema) {'type': 'str', 'url': 'str', 'audio_features': {1: {'loudness': 'float', 'duration_ms': 'int'}, 2: {'loudness': 'float', 'duration_ms': 'int'}}}
- schema_jsonify( ) dict[object, object][source]#
_summary_.
- Parameters:
obj (dict) – _description_.
- Returns:
_description_.
- Return type:
Examples
>>> import pprint >>> from datopy.modeling import schema_jsonify
>>> original_schema = { ... 'name': 'str', 'quantity': 'int', ... 'features': { ... 1: {'volume': 'str', 'duration': 'float'}, ... 2: {'volume': 'str', 'duration': 'float'} ... }, ... 'creator': {'person': {'name': 'str'}, ... 'company': {'name': 'str', 'location': 'str'}} ... } >>> schema = schema_jsonify(original_schema) >>> schema = {**{"title": "title", "description": "description"}, **schema} >>> pprint.pp(schema, compact=True, depth=3) {'title': 'title', 'description': 'description', 'type': 'object', 'properties': {'name': {'type': 'string'}, 'quantity': {'type': 'number'}, 'features': {'type': 'array', 'minItems': 1, 'maxItems': 2, 'uniqueItems': True, 'items': {...}}, 'creator': {'type': 'object', 'properties': {...}, 'required': [...]}}, 'required': ['name', 'quantity', 'features', 'creator']}
- class CustomTypes[source]#
Bases:
objectDefine reusable custom field types.
Notes
Whitespace around commas should be stripped before analysis. For additional info on Pydantic custom types, see: https://docs.pydantic.dev/latest/concepts/types/.
Methods
CSVnumsent
CSVnumstr
CSVstr
- CSVstr#
Lowercase comma-separated string. Excludes numerics and special characters.
alias of
Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes:CSVstr’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z, ]+$’)])]
- CSVnumstr#
Lowercase comma-separated string. Allows numerics; excludes special characters.
alias of
Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes:CSVnumstr’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z0-9,.! ]+$’)])]
- CSVnumsent#
alias of
Annotated[str, FieldInfo(annotation=NoneType, required=True, description=’CustomTypes:CSVnumsent’, metadata=[_PydanticGeneralMetadata(pattern=’^[a-z0-9,.! ]+$’)])]
- class BaseProcessor(
- model: BaseModel,
- query: NamedTuple,
Bases:
objectThe fundamental data processing structure.
- Parameters:
model (BaseModel) – _description_.
query (NamedTuple) – _description_.
Methods
process()Prepare (extract/clean) the retrieved data.
retrieve()Extract data for the query from the API of the supplied model.
to_df()Load the data into a dataframe for further processing or analysis.
- retrieve()[source]#
Extract data for the query from the API of the supplied model.
- Raises:
NotImplementedError – _description_.
- process()[source]#
Prepare (extract/clean) the retrieved data.
- Raises:
NotImplementedError – _description_.
Classes
|
The fundamental data processing structure. |
Define reusable custom field types. |
Functions
|
Apply |
|
Compare two dictionaries recursively and identify missing keys. |
|
Provide a dictionary representation of a list, using indices as keys. |
|
_summary_. |