Jeroen

Developer and 3D Artist, AI / ML / VR

Typing with dataclasses



Recently, I’ve been adding type hints and refactoring an established untyped codebase. I wrote about adopting type hints into my own projects in a previous post.

Instead of returning dictionaries, my data access layer now returns instances of my schema models. In turn all dictionary get statements on that data have been replaced with dot notation, which makes it so much easier to read.

Yet, there remained dictionary getters in some functions, that were wrangling dictionaries as API responses by combining different entities into data transfer objects.

Although these did work perfectly fine, I couldn’t easily typecheck them. Over time these wrangle functions had also become harder to reason with, so figured these could be ripe to refactor into a dataclass.

Turning them into a dataclass would cleanly expose all the fields and make the data shape very explicit, easier to typecheck and easier to reason with.

Library API Response

I have a service hosting a library of books and it contains an API endpoint that’s serving information about each book. The original response looks something like this:

{
## flat fields from the Book entity
"id": 1266,
"title": "Animal Farm",
"author": "George Orwell"
"publish_year": 1944
"pages": 144,
## nested fields from ContentProfile entity
"profile": {
"avatar": "/avatars/005.png",
"id": 3,
"name": "Classics"
},
## computed thumbnail_url path
"thumbnail_url": "/thumbnails/88a5f8/thumbnail.webp",
}

The response combines specific parts of the Book entity, with a nested ContentProfile entity.

It also contains a computed field thumbnail_url.

Defining Dataclasses

Let’s start by defining the fields we want to expose from our Book entity:

@dataclass
class BookInfo:
id: int
title: str
author: str
publish_year: int
pages: int

Next, to cleanly define the nested structure of the ContentProfile entity, we will split it into a separate dataclass of it’s own.

@dataclass
class ProfileInfo:
id: int
name: str
avatar: str

This makes it very clear which fields we are going to expose.

Loading and Filtering Data

In my current mental model for dataclasses, I think of them sort of like a sieve. I dump a bunch of data into them, and they auto load only the data that I have specified in the signature.

To load data from a model instance into our dataclass we can create a classmethod that ingests the model instance. The direct way is to just write out and map each field directly.

@dataclass
class ProfileInfo:
id: int
name: str
avatar: str
@classmethod
def from_model(cls, profile: ContentProfile) -> 'ProfileInfo':
return cls(
id=profile.id,
name=profile.profile_name,
avatar=profile.avatar
)

Something that bothers me about this setup is, duplicating the fields.

Typechecker Limitations

Worse yet, testing the classmethod, I tried supplying the name field with a bool instead of a str.

...
@classmethod
def from_model(cls, profile: ContentProfile) -> 'ProfileInfo':
return cls(
id=profile.id,
name=True,
avatar=profile.avatar
)

Then ran the type checker.

Terminal window
All checks passed!

I was expecting it to throw an error, but apparently this is a known limitation of type checkers.

type checkers do not type-check cls(…) inside class methods

So the suggested approach is to move the logic into a top-level function and call that from inside the class method instead

def book_info_from_model(model: Book) -> 'BookInfo':
return BookInfo(
id=model.id,
title=model.title,
author=model.author,
publish_year=model.publish_year,
pages=model.pages
)
@dataclass(kw_only=True)
class BookInfo():
id: int
title: str
author: str
publish_year: str
pages: int
@classmethod
def from_model(cls, model: Book) -> 'BookInfo':
return book_info_from_model(model)

Great, this solves it and does catch all the type inconsistencies, but I still don’t like doubling up the fields.

We could automate this by extracting all the keyword arguments using a dictionary comprehension and pass them in like this:

def book_info_from_model(model: Book) -> 'BookInfo':
fields = list(BookInfo.__annotations__)
init_data = {f: getattr(model, f) for f in fields
if hasattr(model, f)}
return BookInfo(**init_data)

This removes a lot of boilerplate and code duplication, but now we run into another of the typechecker’s limitations.

type checkers cannot fully validate kwargs based construction

Runtime Validation

To avoid these limitations, we can convert part of the type checking into runtime type validation. One way, is to add validation logic into the post_init method on the dataclass:

class BookInfo():
id: int
title: str
author: str
publish_year: str
pages: int
def __post_init__(cls):
bypass_fields = ['id', 'created_at', 'updated_at']
for f in fields(cls):
if f.name in bypass_fields:
continue
expected = f.type
value = getattr(cls, f.name)
if not isinstance(value, expected): # type: ignore
raise TypeError(
f"Field '{f.name}' expected {expected}, got {type(value)}"
)

Now let’s say we would instantiate the BookInfo, and incorrectly pass a bool to the title field:

book_info = BookInfo(
id=1,
title=True, ## <--- incorrect type, should be a string
author='F. Scott Fitzgerald',
publish_year='1925',
pages=180
)

Our typechecker will stay blindly oblivious:

Terminal window
All checks passed!

But now at least, now we get a TypeError when running the code!

TypeError: Field 'title' expected <class 'str'>, got <class 'bool'>

To clean up, we wrap all this logic into a base class BaseDTO.

@dataclass
class BaseDTO():
@classmethod
def from_model(cls: Type[T], model: Any) -> T:
## grab the fields from this class
fields = list(cls.__annotations__)
## extract into kwargs dict and pass to constructor
init_data = {f: getattr(model, f) for f in fields if hasattr(model, f)}
return cls(**init_data)
def __post_init__(cls):
bypass_fields = ['id', 'created_at', 'updated_at']
for f in fields(cls):
if f.name in bypass_fields:
continue
expected = f.type
value = getattr(cls, f.name)
if not isinstance(value, expected): # type: ignore
raise TypeError(
f"Field '{f.name}' expected {expected}, got {type(value)}"
)

Then we just inherit all this functionality onto our BookInfo

@dataclass
class BookInfo(BaseDTO):
id: int
title: str
author: str
publish_year: int
pages: int
## and instantiate like:
book = db.find_one(Book)
book_info = BookInfo.from_model(book)

Remapping Fields

Next, I wanted to find an easy way to remap / rename certain fields. With the thinking being: as the data enters the sieve, some of the column names that might make sense in the context of the database, might not be the best name for whoever is consuming the API.

In my case, the ContentProfile model contains a profile_name column, that should be exposed to the API as a name field instead.

To make this work, we can add a field_map onto our base class.

@dataclass
class BaseDTO():
field_map: ClassVar[Dict[str, str]] = {} ## <-- store a field map

Then on our ProfileInfo we can add the fields to be renamed like so:

@dataclass
class ProfileInfo(BaseDTO)):
field_map = {'profile_name': 'name'}

Then instead of the basic dictionary comprehension in the from_model method, we can iterate over all the fields, and rename any that matches the field_map items.

@dataclass
class BaseDTO():
key_map: ClassVar[Dict[str, str]] = {}
@classmethod
def from_model(cls: Type[T], model: Any) -> T:
fields = list(cls.__annotations__)
init_data = {}
for field_name in fields:
if field_name=='field_map':
continue
## check if any incoming fields map to this dataclass field
in_field_name = None
for in_field, out_field in cls.field_map.items():
if out_field == field_name:
in_field_name = in_field
break
## add and map incoming field to output field
if in_field_name and hasattr(model, in_field_name):
init_data[field_name] = getattr(model, in_field_name)
## add direct field name match
elif hasattr(model, field_name):
init_data[field_name] = getattr(model, field_name)
return cls(**init_data)
...

This way the profile_name field from the ContentProfile will always be mapped onto the name field of the ProfileInfo dataclass.

query_filters = [('profile_name', 'is', 'Classics')]
profile = db.find_one(ContentProfile, query_filters)
profile_info = ProfileInfo.from_model(profile)

Now it correctly mapped the profile_name to the name field.

print('Profile Name:', profile_info.name)
Profile Name: Classics

Response Object

Finally let’s see how we can build the response object.

Remember the response object was a combination of the BookInfo and ProfileInfo stored on the profile field, and we’d also need to add thumbnail_url.

In my case, since most of the response fields match the BookInfo, I’ll inherit it and create a new dataclass called BookInfoResponse.

@dataclass
class BookInfoResponse(BookInfo):
profile: ProfileInfo
thumbnail_url: str

To instantiate it, we can create a small function

def build_book_info_response(
book: Book,
profile: ContentProfile
) -> BookInfoResponse:
""" builds book info response object """
## first dump and sieve all the book data
book_info = BookInfo.from_model(book)
book_info_dict = asdict(book_info)
## next dump, sieve and transform all the ContentProfile data
profile_info = ProfileInfo.from_model(profile)
## generate the thumbnail url from the Book data.
thumbnail_url = get_thumbnail_url(book)
## return the BookInfoResponse
return BookInfoResponse(
**book_info_dict,
profile=profile_info,
thumbnail_url=thumbnail_url
)

Although converting the book_info into a dictionary, and then dumping it as keyword args into the BookInfoResponse might look a bit dodgy, at this point in the code, the data types will have already been validated through our initial dataclasses.

Lastly, since our API layer doesn’t know how to parse a dataclass into json, I want to add a convenience method on the BookInfoResponse that wraps an asdict function.

class BookInfoResponse(BookInfo):
profile: ProfileInfo
thumbnail_url: str
def to_dict(self) -> Dict:
"""utility method to convert instance into dict"""
return asdict(self)

Then we can use this method in our API layer

API_Response: TypeAlias = Tuple[Dict | None, str | None, int]
class BooksMetadataService:
def get_book_info(self, book_id: int) -> API_Response:
try:
...
## call our response builder
response = logic.build_book_info_response(
book=book,
profile=profile,
)
## return dictionary response
return response.to_dict(), None, 200
except exceptions.BookNotFoundError as e:
return None, str(e), 404
except Exception as e:
self.logger.error(
f"Error getting book info: {e}", exc_info=True)
return None, str(e), 500

I’m pretty happy to use dataclasses in this way, it minimizes boilerplate, but retains type checks either static or runtime.

Best of all it’s done away with the obfiuscated wrangle functions and has made the code easier to reason with.

Pydantic?

While wracking my brain trying to type check these dataclasses, it seemed wherever I looked, all roads were leading to Pydantic, which offers runtime type validation out of the box, and I’ll have to investigate this next.

Apparently there’s a performance overhead to Pydantic which makes it more suitable for validating external data, rather than deploying it as a general purpose dataclass at every turn.

I’m dealing with external data in one specific area of my codebase, and it will be interesting to deploy Pydantic there, see the benefits and compare performance overhead between dictionaries, dataclasses and pydantic objects.