Creating test data is essential for data scientists and data engineers, especially when working with large datasets that need to be transformed. It goes without saying that such transformations should be tested thoroughly: You do not want to wait for a few minutes for your transformation to finish only to realize you’ve misspelled the column name, applied the wrong formula or applied the right formula to the wrong columns! Consequently, you need to create test data and write test functions that apply your transformations to it.
Naturally, the process of creating such test data is tedious. Even more naturally, data scientists and data engineers first confronted with this process tried to establish best practices so that following generations do not have to waste time. However, what I found out from my google searches was not really satisfying. Most blog posts recommend to use
Faker to generate test data. While this certainly is a good starting point, the process of turning the generated test data into
DataFrames in those blog posts felt clunky to me. Because I knew that
Factory Boy is able to provide factories for data generated by
Faker and is used frequently for testing
Django apps, I developed the following short and easy-to-apply approach.
Note: The following method is appropriate for generating small- to medium-sized test data. If you want to generate large datasets and performance is critical, you will be better off using mimesis. Additionally, there is an integration for mimesis into factory boy so that the following method is also feasible for large datasets.
Step 1: Prerequisites
Of course, you need to install
pandas. Other than that, you do not need to install
Faker explicitly; instead, it suffices to install
Factory Boy (which in turn has
Faker as a dependency). If you use
conda, one of the following two commands should suffice.
pip install factory_boy
conda install factory_boy
Step 2: Define a
namedtuple containing (a selection of) the features of your dataset
As its name suggests, a
namedtuple is an extended version of a plain python
tuple. It is a class with specified attributes and utility methods assisting you to construct instances. Assume that our dataset consists of a name, an account balance (USD) and a birth date in the format
YYYY-MM-DD. Based on this, our
namedtuple has to look like this.
from collections import namedtuple Dataset = namedtuple("Dataset", ["name", "account_balance", "birth_date"])
With only one line of code (well, at least without the
import statement), we defined a new class
Dataset with three attributes and got a lot of goodies for free. Most importantly,
namedtuples are compatible with
pandas.DataFrames and with
Step 3: Define a
Factory that creates test datasets according to your specifications
In this step,
Factory Boy and
Faker come into play. Using the
Factory class and the
Faker wrapper from the
factory module, our specification for the dataset is as follows.
from factory import Faker, Factory class DatasetFactory(Factory): """Factory creating test datasets""" class Meta: model = Dataset name = Faker("name") account_balance = Faker("pyfloat", left_digits=6, right_digits=2) birth_date = Faker("date_of_birth", minimum_age=18)
First, we tell our
Factory in the inner
Meta class what object it shall create by assigning our
Dataset class to the
model attribute. Second and last, we specify what kind of data belongs to which feature of our dataset using
Faker providers. In this case, we tell our
Factory that the attribute
name shall be a name (adhering to the system locale), that
account_balance shall be a float of 6 left digits and 2 right digits (as is usual for most currencies) and, finally, that
birth_date shall be a date of birth where the minimum age is 18.
There are three basic uses of our
DatasetFactory. First, to use the
Factory with the specifications as-is, simple call the standard constructor with no arguments.
Example output of the
In : DatasetFactory() Out: Dataset(name='Karen Dunn', account_balance=621653.75, birth_date=date(1980, 4, 14)) In : DatasetFactory() Out: Dataset(name='Karen Murray', account_balance=-97709.61, birth_date=date(1921, 6, 29))
Second, for certain test cases it might be necessary to assign a fixed value to a attribute. In such cases, you may supply appropriate keyword arguments to the constructor.
Fixing values with the
In : DatasetFactory(account_balance=-10000) Out: Dataset(name='Danny Casey', account_balance=-10000, birth_date=date(1998, 6, 14))
Third and last, if you wish to generate a batch of test data the class method
create_batch will be your tool of choice. You may also supply fixed values as keyword arguments.
In : DatasetFactory.create_batch(size=5) Out: [Dataset(name='Amanda Dickerson', account_balance=514402.64, birth_date=date(1908, 5, 26)), Dataset(name='Katherine Johnson', account_balance=-365522.94, birth_date=date(1907, 12, 12)), Dataset(name='Christian Stevenson', account_balance=824680.23, birth_date=date(1983, 8, 12)), Dataset(name='Robert Stewart', account_balance=279501.88, birth_date=date(1954, 4, 19)), Dataset(name='Melissa Snyder', account_balance=-40896.64, birth_date=date(1941, 1, 6))] In : DatasetFactory.create_batch(size=3, account_balance=500) Out: [Dataset(name='Tanya Hernandez', account_balance=500, birth_date=date(1996, 11, 29)), Dataset(name='Samuel Boyd', account_balance=500, birth_date=date(1919, 7, 24)), Dataset(name='Jennifer Edwards', account_balance=500, birth_date=date(1978, 1, 5))]
Step 4: Create a test dataframe and supply the
For the last step, we exploit the fact that
DataFrames are compatible with
namedtuples. Namely, if you call the
DataFrame’s constructor with a list of
pandas will create a
DataFrame with columns named after the
namedtuple’s attributes. As a result, the transformation of a batch of
Dataset objects into a
DataFrame reduces to one line of code.
import pandas as pd df = pd.DataFrame(data=DatasetFactory.create_batch(size=10))
Here’s a sample output.
The final result: Our test dataset as a
In : df Out: name account_balance birth_date 0 Abigail Joseph -186809.54 1941-02-12 1 Hannah Brown -332618.35 1930-08-11 2 Angela Hunt -60649.82 1905-08-06 3 Shelby Hudson 445009.65 1986-02-24 4 Lori Gordon -921797.72 1912-10-05 5 Daniel Rodriguez 622570.37 1966-02-14 6 Carol Morris -964213.50 1914-01-18 7 Jessica Anderson 804757.24 1965-01-06 8 Veronica Edwards -471469.46 1926-04-22 9 Larry Medina 987186.81 1926-12-12
Additional work after creating the test data
If you want to really make sure that your transformations convert dates correctly you will have to apply an extra step. As it stands now, the column
birth_date consists of Python
date objects. To convert them to strings of the desired format, you can use the
df["birth_date"] = df["birth_date"].apply(lambda d: d.strftime("%Y-%m-%d"))
Hello! This is really great, thanks for the post 🙂
I have a quick question. What would you suggest doing if some of the columns in my dataframe begin with a number?
Good catch! Indeed, a column name starting with a number is disastrous to this approach because field names of namedtuples have to be valid identifiers. In such a case, I’d choose between the following two options.
The first one is to strip the column names from leading numbers and proceed as described in this blog post. Afterwards, rename your columns as desired in the resulting dataframe, for instance via
df.columns = ["0name", "1account_balance", "2birth_date"]
The second solution is a little more advanced. We’ll skip the second step of creating a namedtuple and immediately create a DictFactory instead:
Here, we create a Factory having a single wrapper dictionary as a field that contains the proper data of our dataset. By overwriting the _create class method, we can unwrap it again so that a dictionary with test data is returned. The final step of creating the pandas.DataFrame may remain the same. The only caveat of this approach is that you have to use the prefix “wrapper_dict__” when assigning a constant value to a generated Dataset:
would return a dictionary with the “0name” key having the value “James”.
Leave a comment