Creating test data is essential for data scientists and data engineers, especially when working with large datasets that need to be transformed. It goes without saying that such transformations should be tested thoroughly: You do not want to wait for a few minutes for your transformation to finish only to realize you’ve misspelled the column name, applied the wrong formula or applied the right formula to the wrong columns! Consequently, you need to create test data and write test functions that apply your transformations to it.
Naturally, the process of creating such test data is tedious. Even more naturally, data scientists and data engineers first confronted with this process tried to establish best practices so that following generations do not have to waste time. However, what I found out from my google searches was not really satisfying. Most blog posts recommend to use Faker
to generate test data. While this certainly is a good starting point, the process of turning the generated test data into DataFrames
in those blog posts felt clunky to me. Because I knew that Factory Boy
is able to provide factories for data generated by Faker
and is used frequently for testing Django
apps, I developed the following short and easy-to-apply approach.
Note: The following method is appropriate for generating small- to medium-sized test data. If you want to generate large datasets and performance is critical, you will be better off using mimesis. Additionally, there is an integration for mimesis into factory boy so that the following method is also feasible for large datasets.
Step 1: Prerequisites
Of course, you need to install pandas
. Other than that, you do not need to install Faker
explicitly; instead, it suffices to install Factory Boy
(which in turn has Faker
as a dependency). If you use pip
or conda
, one of the following two commands should suffice.
pip install factory_boy
conda install factory_boy
Step 2: Define a namedtuple
containing (a selection of) the features of your dataset
As its name suggests, a namedtuple
is an extended version of a plain python tuple
. It is a class with specified attributes and utility methods assisting you to construct instances. Assume that our dataset consists of a name, an account balance (USD) and a birth date in the format YYYY-MM-DD
. Based on this, our namedtuple
has to look like this.
from collections import namedtuple Dataset = namedtuple("Dataset", ["name", "account_balance", "birth_date"])
With only one line of code (well, at least without the import
statement), we defined a new class Dataset
with three attributes and got a lot of goodies for free. Most importantly, namedtuples
are compatible with pandas.DataFrames
and with Factory Boy
.
Step 3: Define a Factory
that creates test datasets according to your specifications
In this step, Factory Boy
and Faker
come into play. Using the Factory
class and the Faker
wrapper from the factory
module, our specification for the dataset is as follows.
from factory import Faker, Factory class DatasetFactory(Factory): """Factory creating test datasets""" class Meta: model = Dataset name = Faker("name") account_balance = Faker("pyfloat", left_digits=6, right_digits=2) birth_date = Faker("date_of_birth", minimum_age=18)
First, we tell our Factory
in the inner Meta
class what object it shall create by assigning our Dataset
class to the model
attribute. Second and last, we specify what kind of data belongs to which feature of our dataset using Faker
providers. In this case, we tell our Factory
that the attribute name
shall be a name (adhering to the system locale), that account_balance
shall be a float of 6 left digits and 2 right digits (as is usual for most currencies) and, finally, that birth_date
shall be a date of birth where the minimum age is 18.
Using the Factory
There are three basic uses of our DatasetFactory
. First, to use the Factory
with the specifications as-is, simple call the standard constructor with no arguments.
Example output of the DatasetFactory
.
In [4]: DatasetFactory() Out[4]: Dataset(name='Karen Dunn', account_balance=621653.75, birth_date=date(1980, 4, 14)) In [5]: DatasetFactory() Out[5]: Dataset(name='Karen Murray', account_balance=-97709.61, birth_date=date(1921, 6, 29))
Second, for certain test cases it might be necessary to assign a fixed value to a attribute. In such cases, you may supply appropriate keyword arguments to the constructor.
Fixing values with the DatasetFactory
.
In [6]: DatasetFactory(account_balance=-10000) Out[6]: Dataset(name='Danny Casey', account_balance=-10000, birth_date=date(1998, 6, 14))
Third and last, if you wish to generate a batch of test data the class method create_batch
will be your tool of choice. You may also supply fixed values as keyword arguments.
Creating batches.
In [7]: DatasetFactory.create_batch(size=5) Out[7]: [Dataset(name='Amanda Dickerson', account_balance=514402.64, birth_date=date(1908, 5, 26)), Dataset(name='Katherine Johnson', account_balance=-365522.94, birth_date=date(1907, 12, 12)), Dataset(name='Christian Stevenson', account_balance=824680.23, birth_date=date(1983, 8, 12)), Dataset(name='Robert Stewart', account_balance=279501.88, birth_date=date(1954, 4, 19)), Dataset(name='Melissa Snyder', account_balance=-40896.64, birth_date=date(1941, 1, 6))] In [8]: DatasetFactory.create_batch(size=3, account_balance=500) Out[8]: [Dataset(name='Tanya Hernandez', account_balance=500, birth_date=date(1996, 11, 29)), Dataset(name='Samuel Boyd', account_balance=500, birth_date=date(1919, 7, 24)), Dataset(name='Jennifer Edwards', account_balance=500, birth_date=date(1978, 1, 5))]
Step 4: Create a test dataframe and supply the DatasetFactory
’s output
For the last step, we exploit the fact that DataFrames
are compatible with namedtuples
. Namely, if you call the DataFrame
’s constructor with a list of namedtuples
, pandas
will create a DataFrame
with columns named after the namedtuple
’s attributes. As a result, the transformation of a batch of Dataset
objects into a DataFrame
reduces to one line of code.
import pandas as pd df = pd.DataFrame(data=DatasetFactory.create_batch(size=10))
Here’s a sample output.
The final result: Our test dataset as a DataFrame
.
In [5]: df Out[5]: name account_balance birth_date 0 Abigail Joseph -186809.54 1941-02-12 1 Hannah Brown -332618.35 1930-08-11 2 Angela Hunt -60649.82 1905-08-06 3 Shelby Hudson 445009.65 1986-02-24 4 Lori Gordon -921797.72 1912-10-05 5 Daniel Rodriguez 622570.37 1966-02-14 6 Carol Morris -964213.50 1914-01-18 7 Jessica Anderson 804757.24 1965-01-06 8 Veronica Edwards -471469.46 1926-04-22 9 Larry Medina 987186.81 1926-12-12
Additional work after creating the test data
If you want to really make sure that your transformations convert dates correctly you will have to apply an extra step. As it stands now, the column birth_date
consists of Python date
objects. To convert them to strings of the desired format, you can use the strftime
method.
df["birth_date"] = df["birth_date"].apply(lambda d: d.strftime("%Y-%m-%d"))
Hello! This is really great, thanks for the post 🙂
I have a quick question. What would you suggest doing if some of the columns in my dataframe begin with a number?
Good catch! Indeed, a column name starting with a number is disastrous to this approach because field names of namedtuples have to be valid identifiers. In such a case, I’d choose between the following two options.
The first one is to strip the column names from leading numbers and proceed as described in this blog post. Afterwards, rename your columns as desired in the resulting dataframe, for instance via
df.columns = ["0name", "1account_balance", "2birth_date"]
The second solution is a little more advanced. We’ll skip the second step of creating a namedtuple and immediately create a DictFactory instead:
Here, we create a Factory having a single wrapper dictionary as a field that contains the proper data of our dataset. By overwriting the _create class method, we can unwrap it again so that a dictionary with test data is returned. The final step of creating the pandas.DataFrame may remain the same. The only caveat of this approach is that you have to use the prefix “wrapper_dict__” when assigning a constant value to a generated Dataset:
DatasetFactory(wrapper_dict__0name="James")
would return a dictionary with the “0name” key having the value “James”.