Typing with dataclasses
Recently, I’ve been adding type hints and refactoring an established untyped codebase. I wrote about adopting type hints into my own projects in a previous post.
Instead of returning dictionaries, my data access layer now returns instances of my schema models. In turn all dictionary get statements on that data have been replaced with dot notation, which makes it so much easier to read.
Yet, there remained dictionary getters in some functions, that were wrangling dictionaries as API responses by combining different entities into data transfer objects.
Although these did work perfectly fine, I couldn’t easily typecheck them. Over time these wrangle functions had also become harder to reason with, so figured these could be ripe to refactor into a dataclass.
Turning them into a dataclass would cleanly expose all the fields and make the data shape very explicit, easier to typecheck and easier to reason with.
Library API Response
I have a service hosting a library of books and it contains an API endpoint that’s serving information about each book. The original response looks something like this:
{ ## flat fields from the Book entity "id": 1266, "title": "Animal Farm", "author": "George Orwell" "publish_year": 1944 "pages": 144, ## nested fields from ContentProfile entity "profile": { "avatar": "/avatars/005.png", "id": 3, "name": "Classics" }, ## computed thumbnail_url path "thumbnail_url": "/thumbnails/88a5f8/thumbnail.webp",}The response combines specific parts of the Book entity, with a nested ContentProfile entity.
It also contains a computed field thumbnail_url.
Defining Dataclasses
Let’s start by defining the fields we want to expose from our Book entity:
@dataclassclass BookInfo: id: int title: str author: str publish_year: int pages: intNext, to cleanly define the nested structure of the ContentProfile entity, we will split it into a separate dataclass of it’s own.
@dataclassclass ProfileInfo: id: int name: str avatar: strThis makes it very clear which fields we are going to expose.
Loading and Filtering Data
In my current mental model for dataclasses, I think of them sort of like a sieve.
I dump a bunch of data into them, and they auto load only the data that I have specified in the signature.
To load data from a model instance into our dataclass we can create a classmethod that ingests the model instance. The direct way is to just write out and map each field directly.
@dataclassclass ProfileInfo: id: int name: str avatar: str
@classmethod def from_model(cls, profile: ContentProfile) -> 'ProfileInfo': return cls( id=profile.id, name=profile.profile_name, avatar=profile.avatar )Something that bothers me about this setup is, duplicating the fields.
Typechecker Limitations
Worse yet, testing the classmethod, I tried supplying the name field with a bool instead of a str.
... @classmethod def from_model(cls, profile: ContentProfile) -> 'ProfileInfo': return cls( id=profile.id, name=True, avatar=profile.avatar )Then ran the type checker.
All checks passed!I was expecting it to throw an error, but apparently this is a known limitation of type checkers.
type checkers do not type-check cls(…) inside class methods
So the suggested approach is to move the logic into a top-level function and call that from inside the class method instead
def book_info_from_model(model: Book) -> 'BookInfo': return BookInfo( id=model.id, title=model.title, author=model.author, publish_year=model.publish_year, pages=model.pages )
@dataclass(kw_only=True)class BookInfo(): id: int title: str author: str publish_year: str pages: int
@classmethod def from_model(cls, model: Book) -> 'BookInfo': return book_info_from_model(model)Great, this solves it and does catch all the type inconsistencies, but I still don’t like doubling up the fields.
We could automate this by extracting all the keyword arguments using a dictionary comprehension and pass them in like this:
def book_info_from_model(model: Book) -> 'BookInfo': fields = list(BookInfo.__annotations__) init_data = {f: getattr(model, f) for f in fields if hasattr(model, f)} return BookInfo(**init_data)This removes a lot of boilerplate and code duplication, but now we run into another of the typechecker’s limitations.
type checkers cannot fully validate kwargs based construction
Runtime Validation
To avoid these limitations, we can convert part of the type checking into runtime type validation. One way, is to add validation logic into the post_init method on the dataclass:
class BookInfo(): id: int title: str author: str publish_year: str pages: int
def __post_init__(cls): bypass_fields = ['id', 'created_at', 'updated_at'] for f in fields(cls): if f.name in bypass_fields: continue
expected = f.type value = getattr(cls, f.name)
if not isinstance(value, expected): # type: ignore raise TypeError( f"Field '{f.name}' expected {expected}, got {type(value)}" )Now let’s say we would instantiate the BookInfo, and incorrectly pass a bool to the title field:
book_info = BookInfo( id=1, title=True, ## <--- incorrect type, should be a string author='F. Scott Fitzgerald', publish_year='1925', pages=180)Our typechecker will stay blindly oblivious:
All checks passed!But now at least, now we get a TypeError when running the code!
TypeError: Field 'title' expected <class 'str'>, got <class 'bool'>To clean up, we wrap all this logic into a base class BaseDTO.
@dataclassclass BaseDTO():
@classmethod def from_model(cls: Type[T], model: Any) -> T: ## grab the fields from this class fields = list(cls.__annotations__) ## extract into kwargs dict and pass to constructor init_data = {f: getattr(model, f) for f in fields if hasattr(model, f)} return cls(**init_data)
def __post_init__(cls): bypass_fields = ['id', 'created_at', 'updated_at'] for f in fields(cls): if f.name in bypass_fields: continue
expected = f.type value = getattr(cls, f.name)
if not isinstance(value, expected): # type: ignore raise TypeError( f"Field '{f.name}' expected {expected}, got {type(value)}" )Then we just inherit all this functionality onto our BookInfo
@dataclassclass BookInfo(BaseDTO): id: int title: str author: str publish_year: int pages: int
## and instantiate like:book = db.find_one(Book)book_info = BookInfo.from_model(book)Remapping Fields
Next, I wanted to find an easy way to remap / rename certain fields. With the thinking being: as the data enters the sieve, some of the column names that might make sense in the context of the database, might not be the best name for whoever is consuming the API.
In my case, the ContentProfile model contains a profile_name column, that should be exposed to the API as a name field instead.
To make this work, we can add a field_map onto our base class.
@dataclassclass BaseDTO(): field_map: ClassVar[Dict[str, str]] = {} ## <-- store a field mapThen on our ProfileInfo we can add the fields to be renamed like so:
@dataclassclass ProfileInfo(BaseDTO)): field_map = {'profile_name': 'name'}Then instead of the basic dictionary comprehension in the from_model method, we can iterate over all the fields, and rename any that matches the field_map items.
@dataclassclass BaseDTO(): key_map: ClassVar[Dict[str, str]] = {}
@classmethod def from_model(cls: Type[T], model: Any) -> T: fields = list(cls.__annotations__) init_data = {}
for field_name in fields: if field_name=='field_map': continue ## check if any incoming fields map to this dataclass field in_field_name = None for in_field, out_field in cls.field_map.items(): if out_field == field_name: in_field_name = in_field break ## add and map incoming field to output field if in_field_name and hasattr(model, in_field_name): init_data[field_name] = getattr(model, in_field_name) ## add direct field name match elif hasattr(model, field_name): init_data[field_name] = getattr(model, field_name) return cls(**init_data) ...This way the profile_name field from the ContentProfile will always be mapped onto the name field of the ProfileInfo dataclass.
query_filters = [('profile_name', 'is', 'Classics')]profile = db.find_one(ContentProfile, query_filters)profile_info = ProfileInfo.from_model(profile)Now it correctly mapped the profile_name to the name field.
print('Profile Name:', profile_info.name)
Profile Name: ClassicsResponse Object
Finally let’s see how we can build the response object.
Remember the response object was a combination of the BookInfo and ProfileInfo stored on the profile field, and we’d also need to add thumbnail_url.
In my case, since most of the response fields match the BookInfo, I’ll inherit it and create a new dataclass called BookInfoResponse.
@dataclassclass BookInfoResponse(BookInfo): profile: ProfileInfo thumbnail_url: strTo instantiate it, we can create a small function
def build_book_info_response( book: Book, profile: ContentProfile) -> BookInfoResponse:""" builds book info response object """## first dump and sieve all the book databook_info = BookInfo.from_model(book)book_info_dict = asdict(book_info)
## next dump, sieve and transform all the ContentProfile dataprofile_info = ProfileInfo.from_model(profile)
## generate the thumbnail url from the Book data.thumbnail_url = get_thumbnail_url(book)
## return the BookInfoResponsereturn BookInfoResponse( **book_info_dict, profile=profile_info, thumbnail_url=thumbnail_url)Although converting the book_info into a dictionary, and then dumping it as keyword args into the BookInfoResponse might look a bit dodgy, at this point in the code, the data types will have already been validated through our initial dataclasses.
Lastly, since our API layer doesn’t know how to parse a dataclass into json, I want to add a convenience method on the BookInfoResponse that wraps an asdict function.
class BookInfoResponse(BookInfo): profile: ProfileInfo thumbnail_url: str
def to_dict(self) -> Dict: """utility method to convert instance into dict""" return asdict(self)Then we can use this method in our API layer
API_Response: TypeAlias = Tuple[Dict | None, str | None, int]
class BooksMetadataService: def get_book_info(self, book_id: int) -> API_Response: try: ... ## call our response builder response = logic.build_book_info_response( book=book, profile=profile, )
## return dictionary response return response.to_dict(), None, 200
except exceptions.BookNotFoundError as e: return None, str(e), 404 except Exception as e: self.logger.error( f"Error getting book info: {e}", exc_info=True) return None, str(e), 500I’m pretty happy to use dataclasses in this way, it minimizes boilerplate, but retains type checks either static or runtime.
Best of all it’s done away with the obfiuscated wrangle functions and has made the code easier to reason with.
Pydantic?
While wracking my brain trying to type check these dataclasses, it seemed wherever I looked, all roads were leading to Pydantic, which offers runtime type validation out of the box, and I’ll have to investigate this next.
Apparently there’s a performance overhead to Pydantic which makes it more suitable for validating external data, rather than deploying it as a general purpose dataclass at every turn.
I’m dealing with external data in one specific area of my codebase, and it will be interesting to deploy Pydantic there, see the benefits and compare performance overhead between dictionaries, dataclasses and pydantic objects.