Creating test data is essential for data scientists and data engineers, especially when working with large datasets that need to be transformed. It goes without saying that such transformations should be tested thoroughly: You do not want to wait for a few minutes for your transformation to finish only to realize you've misspelled the column name, applied the wrong formula or applied the right formula to the wrong columns! Consequently, you need to create test data and write test functions that apply your transformations to it.
Naturally, the process of creating such test data is tedious. Even more
naturally, data scientists and data engineers first confronted with this process tried
to establish best practices so that following generations do not have to waste
time. However, what I found out from my google searches was not really
satisfying. Most blog posts recommend to use Faker
to generate test data.
While this certainly is a good starting point, the process of turning the
generated test data into DataFrames
in those blog posts felt clunky to me.
Because I knew that Factory Boy
is able to provide factories for data
generated by Faker
and is used frequently for testing Django
apps, I
developed the following short and easy-to-apply approach.
Note: The following method is appropriate for generating small- to medium-sized test data. If you want to generate large datasets and performance is critical, you will be better off using mimesis. Additionally, there is an integration for mimesis into factory boy so that the following method is also feasible for large datasets.
Step 1: Prerequisites
Of course, you need to install pandas
. Other than that, you do not need to
install Faker
explicitly; instead, it suffices to install Factory Boy
(which in turn has Faker
as a dependency). If you use pip
or conda
, one of
the following two commands should suffice.
Step 2: Define a namedtuple
containing (a selection of) the features of your dataset
As its name suggests, a namedtuple
is an extended version of a plain python
tuple
. It is a class with specified attributes and utility methods assisting
you to construct instances. Assume that our dataset consists of a name, an
account balance (USD) and a birth date in the format YYYY-MM-DD
. Based on
this, our namedtuple
has to look like this.
With only one line of code (well, at least without the import
statement), we
defined a new class Dataset
with three attributes and got a lot of goodies for
free. Most importantly, namedtuples
are compatible with pandas.DataFrames
and with Factory Boy
.
Step 3: Define a Factory
that creates test datasets according to your specifications
In this step, Factory Boy
and Faker
come into play. Using the Factory
class and the Faker
wrapper from the factory
module, our specification for
the dataset is as follows.
First, we tell our Factory
in the inner Meta
class what object it shall
create by assigning our Dataset
class to the model
attribute. Second and
last, we specify what kind of data belongs to which feature of our dataset using
Faker
providers. In this case, we tell our Factory
that the attribute name
shall be a name (adhering to the system locale), that account_balance
shall be a
float of 6 left digits and 2 right digits (as is usual for most currencies) and,
finally, that birth_date
shall be a date of birth where the minimum age is 18.
Using the Factory
There are three basic uses of our DatasetFactory
.
First, to use the Factory
with the specifications as-is, simple call the standard constructor with no arguments.
Second, for certain test cases it might be necessary to assign a fixed value to a attribute. In such cases, you may supply appropriate keyword arguments to the constructor.
Third and last, if you wish to generate a batch of test data the class method
create_batch
will be your tool of choice. You may also supply fixed values as keyword arguments.
Step 4: Create a test dataframe and supply the DatasetFactory
's output
For the last step, we exploit the fact that DataFrames
are compatible with
namedtuples
. Namely, if you call the DataFrame
's constructor with a list of
namedtuples
, pandas
will create a DataFrame
with columns named after the
namedtuple
's attributes. As a result, the transformation of a batch of
Dataset
objects into a DataFrame
reduces to one line of code.
Here's a sample output.
Additional work after creating the test data
If you want to really make sure that your transformations convert dates
correctly you will have to apply an extra step. As it stands now, the column
birth_date
consists of Python date
objects. To convert them to strings of
the desired format, you can use the strftime
method.
1df["birth_date"] = df["birth_date"].apply(lambda d: d.strftime("%Y-%m-%d"))