datopy.models.media#

Description

Data models, validators, and ETL tools for scraped media data.

Includes support for film reviews (via IMDb), music albums (via Spotify), and related information (via Wikipedia).

Note

WIP.

Overview#

Data models

`IMDbFilm`	Data model for processed imdb metadata.
`SpotifyAlbum`	Data model for processed Spotify metadata.

API#

class MediaQuery( title: str, artist: str | None = None, )[source]#

Bases: NamedTuple

Query object types for media metadata retrieval.

title: str#: Alias for field number 0

artist: str | None#: Alias for field number 1

class Film( title: str, artist: str | None = None, )#: Bases: MediaQuery

class Album( title: str, artist: str | None = None, )#: Bases: MediaQuery

class Book( title: str, artist: str | None = None, )#: Bases: MediaQuery

pydantic model IMDbFilm[source]#

Bases: BaseModel

Data model for processed imdb metadata.

Examples

>>> from pydantic import ValidationError
>>> from datopy.models.media import IMDbFilm
>>> from datopy._examples import imdb_film_retrieve

Valid film

>>> valid_film = IMDbFilm(
...     title='name 10!', imdb_id='tt1234567', kind='movie',
...     year=1990, rating=7.2, votes=122,
...     genres='romantic comedy, thriller', cast='mrs smith,mr smith',
...     plot='alas! once upon a time, ...',
...     budget_mil=1123929)

Invalid film

>>> invalid_film = dict(
...     title='name', imdb_id='tt12', year=1975, votes=-2, rating=5.0)
>>> try:
...     IMDbFilm(**invalid_film)
... except ValidationError as e:
...     print(e)          # use pprint.pp(e.errors()) for easy-to-read list
3 validation errors for IMDbFilm
imdb_id
  String should match pattern '^tt.*\d{7}$' [type=string_pattern_mismatch, input_value='tt12', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/string_pattern_mismatch
kind
  Field required [type=missing, input_value={'title': 'name', 'imdb_i...tes': -2, 'rating': 5.0}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing
votes
  Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-2, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/greater_than_equal

Survey available fields and types

>>> import pprint
>>> from datopy.models.media import Film
>>> from datopy._examples import imdb_film_retrieve
>>> from datopy.modeling import apply_recursive
>>> film = imdb_film_retrieve(Film('spirited away'))

Show JSON schema

{
   "title": "IMDbFilm",
   "description": "Data model for processed imdb metadata.\n\nExamples\n--------\n>>> from pydantic import ValidationError\n>>> from datopy.models.media import IMDbFilm\n>>> from datopy._examples import imdb_film_retrieve\n\nValid film\n\n>>> valid_film = IMDbFilm(\n...     title='name 10!', imdb_id='tt1234567', kind='movie',\n...     year=1990, rating=7.2, votes=122,\n...     genres='romantic comedy, thriller', cast='mrs smith,mr smith',\n...     plot='alas! once upon a time, ...',\n...     budget_mil=1123929)\n\nInvalid film\n\n>>> invalid_film = dict(\n...     title='name', imdb_id='tt12', year=1975, votes=-2, rating=5.0)\n>>> try:\n...     IMDbFilm(**invalid_film)\n... except ValidationError as e:\n...     print(e)          # use pprint.pp(e.errors()) for easy-to-read list\n3 validation errors for IMDbFilm\nimdb_id\n  String should match pattern '^tt.*\\d{7}$' [type=string_pattern_mismatch, input_value='tt12', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.8/v/string_pattern_mismatch\nkind\n  Field required [type=missing, input_value={'title': 'name', 'imdb_i...tes': -2, 'rating': 5.0}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.8/v/missing\nvotes\n  Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-2, input_type=int]\n    For further information visit https://errors.pydantic.dev/2.8/v/greater_than_equal\n\nSurvey available fields and types\n\n>>> import pprint\n>>> from datopy.models.media import Film\n>>> from datopy._examples import imdb_film_retrieve\n>>> from datopy.modeling import apply_recursive\n>>> film = imdb_film_retrieve(Film('spirited away'))\n\n..\n    # >>> film.keys()\n    # >>> pprint.pp(apply_recursive(lambda x: type(x).__name__, film), depth=3)",
   "type": "object",
   "properties": {
      "title": {
         "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumstr``",
         "pattern": "^[a-z0-9,.! ]+$",
         "title": "Title",
         "type": "string"
      },
      "imdb_id": {
         "description": "Unique 7-digit IMDb tt identifier",
         "pattern": "^tt.*\\d{7}$",
         "title": "Imdb Id",
         "type": "string"
      },
      "kind": {
         "description": "Retrieved from: `type`",
         "examples": [
            "movie",
            "tv series"
         ],
         "pattern": "^[a-z0-9,.! ]+$",
         "title": "Kind",
         "type": "string"
      },
      "year": {
         "maximum": 3000,
         "minimum": 1880,
         "title": "Year",
         "type": "integer"
      },
      "rating": {
         "maximum": 10.0,
         "minimum": 0.0,
         "title": "Rating",
         "type": "number"
      },
      "votes": {
         "minimum": 0,
         "title": "Votes",
         "type": "integer"
      },
      "runtime_mins": {
         "anyOf": [
            {
               "exclusiveMinimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Runtime Mins"
      },
      "genres": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Genres"
      },
      "countries": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Countries"
      },
      "director": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Director"
      },
      "writer": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Writer"
      },
      "composer": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Composer"
      },
      "cast": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Cast"
      },
      "plot": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``",
               "pattern": "^[a-z0-9,.! ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Plot"
      },
      "synopsis": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``",
               "pattern": "^[a-z0-9,.! ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Synopsis"
      },
      "plot_outline": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``",
               "pattern": "^[a-z0-9,.! ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Plot Outline"
      },
      "budget_mil": {
         "anyOf": [
            {
               "minimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Strip $/, & text after first space",
         "title": "Budget Mil"
      },
      "opening_weekend_gross_mil": {
         "anyOf": [
            {
               "minimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Opening Weekend Gross Mil"
      },
      "cumulative_worldwide_gross_mil": {
         "anyOf": [
            {
               "minimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Cumulative Worldwide Gross Mil"
      }
   },
   "required": [
      "title",
      "imdb_id",
      "kind",
      "year",
      "rating",
      "votes"
   ]
}

field title: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] [Required]#

CustomTypes : CSVnumstr

Constraints:

pattern = ^[a-z0-9,.! ]+$

field imdb_id: str [Required]#

Unique 7-digit IMDb tt identifier

Constraints:

pattern = ^tt.*d{7}$

Validated by:

check_alphanumeric

field kind: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] [Required]#

Retrieved from: type

Constraints:

pattern = ^[a-z0-9,.! ]+$

Validated by:

check_alphanumeric

field year: int [Required]#

Constraints:

ge = 1880
le = 3000

field rating: float [Required]#

Constraints:

ge = 0
le = 10

field votes: int [Required]#

Constraints:

ge = 0

field runtime_mins: float | None = None#

Constraints:

gt = 0

field genres: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#

field countries: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#

field director: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#

field writer: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#

field composer: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#

field cast: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#

field plot: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] | None = None#

field synopsis: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] | None = None#

field plot_outline: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] | None = None#

field budget_mil: float | None = None#

Strip $/, & text after first space

Constraints:

ge = 0

field opening_weekend_gross_mil: float | None = None#

Constraints:

ge = 0

field cumulative_worldwide_gross_mil: float | None = None#

Constraints:

ge = 0

validator check_alphanumeric » kind, imdb_id[source]#

pydantic model SpotifyAlbum[source]#

Bases: BaseModel

Data model for processed Spotify metadata.

Raw data schema reference: ‘datopy/output/spotify_album_schema.json’.

Show JSON schema

{
   "title": "SpotifyAlbum",
   "description": "Data model for processed Spotify metadata.\n\nRaw data schema reference: 'datopy/output/spotify_album_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      },
      "album_type": {
         "title": "Album Type",
         "type": "string"
      }
   },
   "required": [
      "title",
      "album_type"
   ]
}

Fields:

title (str)
album_type (str)

field title: str [Required]#

field album_type: str [Required]#

pydantic model WikiBook[source]#

Bases: BaseModel

Data model for processed Wikipedia novel metadata.

Raw data schema reference: ‘output/wiki_book_schema.json’.

Show JSON schema

{
   "title": "WikiBook",
   "description": "Data model for processed Wikipedia novel metadata.\n\nRaw data schema reference: 'output/wiki_book_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      }
   },
   "required": [
      "title"
   ]
}

Fields:

title (str)

field title: str [Required]#

pydantic model WikiFilm[source]#

Bases: BaseModel

Data model for processed Wikipedia film metadata.

Raw data schema reference: ‘datopy/output/wiki_film_schema.json’.

Show JSON schema

{
   "title": "WikiFilm",
   "description": "Data model for processed Wikipedia film metadata.\n\nRaw data schema reference: 'datopy/output/wiki_film_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      }
   },
   "required": [
      "title"
   ]
}

Fields:

title (str)

field title: str [Required]#

pydantic model WikiAlbum[source]#

Bases: BaseModel

Data model for processed Wikipedia album metadata.

Raw data schema reference: ‘datopy/output/wiki_album_schema.json’.

Show JSON schema

{
   "title": "WikiAlbum",
   "description": "Data model for processed Wikipedia album metadata.\n\nRaw data schema reference: 'datopy/output/wiki_album_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      }
   },
   "required": [
      "title"
   ]
}

Fields:

title (str)

field title: str [Required]#

class IMDbFilmProcessor( model: BaseModel, query: NamedTuple, )[source]#

Bases: BaseProcessor

_summary_.

Methods

`process`()	Prepare (extract/clean) the retrieved data.
`retrieve`()	Extract data for the query from the API of the supplied model.

retrieve()[source]#

Extract data for the query from the API of the supplied model.

Raises:: NotImplementedError – _description_.

process()[source]#

Prepare (extract/clean) the retrieved data.

Raises:: NotImplementedError – _description_.

Classes

`Album`(title[, artist])
`Book`(title[, artist])
`Film`(title[, artist])
`IMDbFilmProcessor`(model, query)	_summary_.
`MediaQuery`(title[, artist])	Query object types for media metadata retrieval.