Creating test data with Faker and Factory Boy

Creating test data is essential for data scientists and data engineers, especially when working with large datasets that need to be transformed. It goes without saying that such transformations should be tested thoroughly: You do not want to wait for a few minutes for your transformation to finish only to realize you’ve misspelled the column name, applied the wrong formula or applied the right formula to the wrong columns! Consequently, you need to create test data and write test functions that apply your transformations to it.

Naturally, the process of creating such test data is tedious. Even more naturally, data scientists and data engineers first confronted with this process tried to establish best practices so that following generations do not have to waste time. However, what I found out from my google searches was not really satisfying. Most blog posts recommend to use Faker to generate test data. While this certainly is a good starting point, the process of turning the generated test data into DataFrames in those blog posts felt clunky to me. Because I knew that Factory Boy is able to provide factories for data generated by Faker and is used frequently for testing Django apps, I developed the following short and easy-to-apply approach.

Note: The following method is appropriate for generating small- to medium-sized test data. If you want to generate large datasets and performance is critical, you will be better off using mimesis. Additionally, there is an integration for mimesis into factory boy so that the following method is also feasible for large datasets.

Step 1: Prerequisites

Of course, you need to install pandas. Other than that, you do not need to install Faker explicitly; instead, it suffices to install Factory Boy (which in turn has Faker as a dependency). If you use pip or conda, one of the following two commands should suffice.

pip install factory_boy

conda install factory_boy

Step 2: Define a namedtuple containing (a selection of) the features of your dataset

As its name suggests, a namedtuple is an extended version of a plain python tuple. It is a class with specified attributes and utility methods assisting you to construct instances. Assume that our dataset consists of a name, an account balance (USD) and a birth date in the format YYYY-MM-DD. Based on this, our namedtuple has to look like this.

from collections import namedtuple

Dataset = namedtuple("Dataset", ["name", "account_balance", "birth_date"])

With only one line of code (well, at least without the import statement), we defined a new class Dataset with three attributes and got a lot of goodies for free. Most importantly, namedtuples are compatible with pandas.DataFrames and with Factory Boy.

Step 3: Define a Factory that creates test datasets according to your specifications

In this step, Factory Boy and Faker come into play. Using the Factory class and the Faker wrapper from the factory module, our specification for the dataset is as follows.

from factory import Faker, Factory

class DatasetFactory(Factory):
    """Factory creating test datasets"""

    class Meta:
        model = Dataset

    name = Faker("name")
    account_balance = Faker("pyfloat", left_digits=6, right_digits=2)
    birth_date = Faker("date_of_birth", minimum_age=18)

First, we tell our Factory in the inner Meta class what object it shall create by assigning our Dataset class to the model attribute. Second and last, we specify what kind of data belongs to which feature of our dataset using Faker providers. In this case, we tell our Factory that the attribute name shall be a name (adhering to the system locale), that account_balance shall be a float of 6 left digits and 2 right digits (as is usual for most currencies) and, finally, that birth_date shall be a date of birth where the minimum age is 18.

Using the Factory

There are three basic uses of our DatasetFactory. First, to use the Factory with the specifications as-is, simple call the standard constructor with no arguments.

Example output of the DatasetFactory.

In [4]: DatasetFactory()
Out[4]: Dataset(name='Karen Dunn', account_balance=621653.75, birth_date=date(1980, 4, 14))

In [5]: DatasetFactory()
Out[5]: Dataset(name='Karen Murray', account_balance=-97709.61, birth_date=date(1921, 6, 29))

Second, for certain test cases it might be necessary to assign a fixed value to a attribute. In such cases, you may supply appropriate keyword arguments to the constructor.

Fixing values with the DatasetFactory.

In [6]: DatasetFactory(account_balance=-10000)
Out[6]: Dataset(name='Danny Casey', account_balance=-10000, birth_date=date(1998, 6, 14))

Third and last, if you wish to generate a batch of test data the class method create_batch will be your tool of choice. You may also supply fixed values as keyword arguments.

Creating batches.

In [7]: DatasetFactory.create_batch(size=5)
[Dataset(name='Amanda Dickerson', account_balance=514402.64, birth_date=date(1908, 5, 26)),
 Dataset(name='Katherine Johnson', account_balance=-365522.94, birth_date=date(1907, 12, 12)),
 Dataset(name='Christian Stevenson', account_balance=824680.23, birth_date=date(1983, 8, 12)),
 Dataset(name='Robert Stewart', account_balance=279501.88, birth_date=date(1954, 4, 19)),
 Dataset(name='Melissa Snyder', account_balance=-40896.64, birth_date=date(1941, 1, 6))]

In [8]: DatasetFactory.create_batch(size=3, account_balance=500)
[Dataset(name='Tanya Hernandez', account_balance=500, birth_date=date(1996, 11, 29)),
 Dataset(name='Samuel Boyd', account_balance=500, birth_date=date(1919, 7, 24)),
 Dataset(name='Jennifer Edwards', account_balance=500, birth_date=date(1978, 1, 5))]

Step 4: Create a test dataframe and supply the DatasetFactory’s output

For the last step, we exploit the fact that DataFrames are compatible with namedtuples. Namely, if you call the DataFrame’s constructor with a list of namedtuples, pandas will create a DataFrame with columns named after the namedtuple’s attributes. As a result, the transformation of a batch of Dataset objects into a DataFrame reduces to one line of code.

import pandas as pd

df = pd.DataFrame(data=DatasetFactory.create_batch(size=10))

Here’s a sample output.

The final result: Our test dataset as a DataFrame.

In [5]: df
               name  account_balance  birth_date
0    Abigail Joseph       -186809.54  1941-02-12
1      Hannah Brown       -332618.35  1930-08-11
2       Angela Hunt        -60649.82  1905-08-06
3     Shelby Hudson        445009.65  1986-02-24
4       Lori Gordon       -921797.72  1912-10-05
5  Daniel Rodriguez        622570.37  1966-02-14
6      Carol Morris       -964213.50  1914-01-18
7  Jessica Anderson        804757.24  1965-01-06
8  Veronica Edwards       -471469.46  1926-04-22
9      Larry Medina        987186.81  1926-12-12

Additional work after creating the test data

If you want to really make sure that your transformations convert dates correctly you will have to apply an extra step. As it stands now, the column birth_date consists of Python date objects. To convert them to strings of the desired format, you can use the strftime method.

df["birth_date"] = df["birth_date"].apply(lambda d: d.strftime("%Y-%m-%d"))