datopy.models.media#

Description

Data models, validators, and ETL tools for scraped media data.

Includes support for film reviews (via IMDb), music albums (via Spotify), and related information (via Wikipedia).

Note

WIP.

Overview#

Data models

IMDbFilm

Data model for processed imdb metadata.

SpotifyAlbum

Data model for processed Spotify metadata.

API#

class MediaQuery(
title: str,
artist: str | None = None,
)[source]#

Bases: NamedTuple

Query object types for media metadata retrieval.

title: str#

Alias for field number 0

artist: str | None#

Alias for field number 1

class Film(
title: str,
artist: str | None = None,
)#

Bases: MediaQuery

class Album(
title: str,
artist: str | None = None,
)#

Bases: MediaQuery

class Book(
title: str,
artist: str | None = None,
)#

Bases: MediaQuery

pydantic model IMDbFilm[source]#

Bases: BaseModel

Data model for processed imdb metadata.

Examples

>>> from pydantic import ValidationError
>>> from datopy.models.media import IMDbFilm
>>> from datopy._examples import imdb_film_retrieve

Valid film

>>> valid_film = IMDbFilm(
...     title='name 10!', imdb_id='tt1234567', kind='movie',
...     year=1990, rating=7.2, votes=122,
...     genres='romantic comedy, thriller', cast='mrs smith,mr smith',
...     plot='alas! once upon a time, ...',
...     budget_mil=1123929)

Invalid film

>>> invalid_film = dict(
...     title='name', imdb_id='tt12', year=1975, votes=-2, rating=5.0)
>>> try:
...     IMDbFilm(**invalid_film)
... except ValidationError as e:
...     print(e)          # use pprint.pp(e.errors()) for easy-to-read list
3 validation errors for IMDbFilm
imdb_id
  String should match pattern '^tt.*\d{7}$' [type=string_pattern_mismatch, input_value='tt12', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/string_pattern_mismatch
kind
  Field required [type=missing, input_value={'title': 'name', 'imdb_i...tes': -2, 'rating': 5.0}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing
votes
  Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-2, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/greater_than_equal

Survey available fields and types

>>> import pprint
>>> from datopy.models.media import Film
>>> from datopy._examples import imdb_film_retrieve
>>> from datopy.modeling import apply_recursive
>>> film = imdb_film_retrieve(Film('spirited away'))

Show JSON schema
{
   "title": "IMDbFilm",
   "description": "Data model for processed imdb metadata.\n\nExamples\n--------\n>>> from pydantic import ValidationError\n>>> from datopy.models.media import IMDbFilm\n>>> from datopy._examples import imdb_film_retrieve\n\nValid film\n\n>>> valid_film = IMDbFilm(\n...     title='name 10!', imdb_id='tt1234567', kind='movie',\n...     year=1990, rating=7.2, votes=122,\n...     genres='romantic comedy, thriller', cast='mrs smith,mr smith',\n...     plot='alas! once upon a time, ...',\n...     budget_mil=1123929)\n\nInvalid film\n\n>>> invalid_film = dict(\n...     title='name', imdb_id='tt12', year=1975, votes=-2, rating=5.0)\n>>> try:\n...     IMDbFilm(**invalid_film)\n... except ValidationError as e:\n...     print(e)          # use pprint.pp(e.errors()) for easy-to-read list\n3 validation errors for IMDbFilm\nimdb_id\n  String should match pattern '^tt.*\\d{7}$' [type=string_pattern_mismatch, input_value='tt12', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.8/v/string_pattern_mismatch\nkind\n  Field required [type=missing, input_value={'title': 'name', 'imdb_i...tes': -2, 'rating': 5.0}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.8/v/missing\nvotes\n  Input should be greater than or equal to 0 [type=greater_than_equal, input_value=-2, input_type=int]\n    For further information visit https://errors.pydantic.dev/2.8/v/greater_than_equal\n\nSurvey available fields and types\n\n>>> import pprint\n>>> from datopy.models.media import Film\n>>> from datopy._examples import imdb_film_retrieve\n>>> from datopy.modeling import apply_recursive\n>>> film = imdb_film_retrieve(Film('spirited away'))\n\n..\n    # >>> film.keys()\n    # >>> pprint.pp(apply_recursive(lambda x: type(x).__name__, film), depth=3)",
   "type": "object",
   "properties": {
      "title": {
         "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumstr``",
         "pattern": "^[a-z0-9,.! ]+$",
         "title": "Title",
         "type": "string"
      },
      "imdb_id": {
         "description": "Unique 7-digit IMDb tt identifier",
         "pattern": "^tt.*\\d{7}$",
         "title": "Imdb Id",
         "type": "string"
      },
      "kind": {
         "description": "Retrieved from: `type`",
         "examples": [
            "movie",
            "tv series"
         ],
         "pattern": "^[a-z0-9,.! ]+$",
         "title": "Kind",
         "type": "string"
      },
      "year": {
         "maximum": 3000,
         "minimum": 1880,
         "title": "Year",
         "type": "integer"
      },
      "rating": {
         "maximum": 10.0,
         "minimum": 0.0,
         "title": "Rating",
         "type": "number"
      },
      "votes": {
         "minimum": 0,
         "title": "Votes",
         "type": "integer"
      },
      "runtime_mins": {
         "anyOf": [
            {
               "exclusiveMinimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Runtime Mins"
      },
      "genres": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Genres"
      },
      "countries": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Countries"
      },
      "director": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Director"
      },
      "writer": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Writer"
      },
      "composer": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Composer"
      },
      "cast": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVstr``",
               "pattern": "^[a-z, ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Cast"
      },
      "plot": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``",
               "pattern": "^[a-z0-9,.! ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Plot"
      },
      "synopsis": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``",
               "pattern": "^[a-z0-9,.! ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Synopsis"
      },
      "plot_outline": {
         "anyOf": [
            {
               "description": ":attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``",
               "pattern": "^[a-z0-9,.! ]+$",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Plot Outline"
      },
      "budget_mil": {
         "anyOf": [
            {
               "minimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Strip $/, & text after first space",
         "title": "Budget Mil"
      },
      "opening_weekend_gross_mil": {
         "anyOf": [
            {
               "minimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Opening Weekend Gross Mil"
      },
      "cumulative_worldwide_gross_mil": {
         "anyOf": [
            {
               "minimum": 0.0,
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Cumulative Worldwide Gross Mil"
      }
   },
   "required": [
      "title",
      "imdb_id",
      "kind",
      "year",
      "rating",
      "votes"
   ]
}

Fields:
Validators:
field title: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] [Required]#

CustomTypes : CSVnumstr

Constraints:
  • pattern = ^[a-z0-9,.! ]+$

field imdb_id: str [Required]#

Unique 7-digit IMDb tt identifier

Constraints:
  • pattern = ^tt.*d{7}$

Validated by:
field kind: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] [Required]#

Retrieved from: type

Constraints:
  • pattern = ^[a-z0-9,.! ]+$

Validated by:
field year: int [Required]#
Constraints:
  • ge = 1880

  • le = 3000

field rating: float [Required]#
Constraints:
  • ge = 0

  • le = 10

field votes: int [Required]#
Constraints:
  • ge = 0

field runtime_mins: float | None = None#
Constraints:
  • gt = 0

field genres: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#
field countries: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#
field director: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#
field writer: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#
field composer: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#
field cast: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVstr``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z, ]+$')])] | None = None#
field plot: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] | None = None#
field synopsis: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] | None = None#
field plot_outline: Annotated[str, FieldInfo(annotation=NoneType, required=True, description=':attr:`~datopy.modeling.CustomTypes` : ``CSVnumsent``', metadata=[_PydanticGeneralMetadata(pattern='^[a-z0-9,.! ]+$')])] | None = None#
field budget_mil: float | None = None#

Strip $/, & text after first space

Constraints:
  • ge = 0

field opening_weekend_gross_mil: float | None = None#
Constraints:
  • ge = 0

field cumulative_worldwide_gross_mil: float | None = None#
Constraints:
  • ge = 0

validator check_alphanumeric  »  kind, imdb_id[source]#
pydantic model SpotifyAlbum[source]#

Bases: BaseModel

Data model for processed Spotify metadata.

Raw data schema reference: ‘datopy/output/spotify_album_schema.json’.

Show JSON schema
{
   "title": "SpotifyAlbum",
   "description": "Data model for processed Spotify metadata.\n\nRaw data schema reference: 'datopy/output/spotify_album_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      },
      "album_type": {
         "title": "Album Type",
         "type": "string"
      }
   },
   "required": [
      "title",
      "album_type"
   ]
}

Fields:
field title: str [Required]#
field album_type: str [Required]#
pydantic model WikiBook[source]#

Bases: BaseModel

Data model for processed Wikipedia novel metadata.

Raw data schema reference: ‘output/wiki_book_schema.json’.

Show JSON schema
{
   "title": "WikiBook",
   "description": "Data model for processed Wikipedia novel metadata.\n\nRaw data schema reference: 'output/wiki_book_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      }
   },
   "required": [
      "title"
   ]
}

Fields:
field title: str [Required]#
pydantic model WikiFilm[source]#

Bases: BaseModel

Data model for processed Wikipedia film metadata.

Raw data schema reference: ‘datopy/output/wiki_film_schema.json’.

Show JSON schema
{
   "title": "WikiFilm",
   "description": "Data model for processed Wikipedia film metadata.\n\nRaw data schema reference: 'datopy/output/wiki_film_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      }
   },
   "required": [
      "title"
   ]
}

Fields:
field title: str [Required]#
pydantic model WikiAlbum[source]#

Bases: BaseModel

Data model for processed Wikipedia album metadata.

Raw data schema reference: ‘datopy/output/wiki_album_schema.json’.

Show JSON schema
{
   "title": "WikiAlbum",
   "description": "Data model for processed Wikipedia album metadata.\n\nRaw data schema reference: 'datopy/output/wiki_album_schema.json'.",
   "type": "object",
   "properties": {
      "title": {
         "title": "Title",
         "type": "string"
      }
   },
   "required": [
      "title"
   ]
}

Fields:
field title: str [Required]#
class IMDbFilmProcessor(
model: BaseModel,
query: NamedTuple,
)[source]#

Bases: BaseProcessor

_summary_.

Methods

process()

Prepare (extract/clean) the retrieved data.

retrieve()

Extract data for the query from the API of the supplied model.

retrieve()[source]#

Extract data for the query from the API of the supplied model.

Raises:

NotImplementedError – _description_.

process()[source]#

Prepare (extract/clean) the retrieved data.

Raises:

NotImplementedError – _description_.


Classes

Album(title[, artist])

Book(title[, artist])

Film(title[, artist])

IMDbFilmProcessor(model, query)

_summary_.

MediaQuery(title[, artist])

Query object types for media metadata retrieval.