Creating test data is essential for data scientists and data engineers, especially when working with large datasets that need to be transformed. It goes without saying that such transformations should be tested thoroughly: You do not want to wait for a few minutes for your transformation to finish only to realize you've misspelled the column name, applied the wrong formula or applied the right formula to the wrong columns! Consequently, you need to create test data and write test functions that apply your transformations to it.

Naturally, the process of creating such test data is tedious. Even more naturally, data scientists and data engineers first confronted with this process tried to establish best practices so that following generations do not have to waste time. However, what I found out from my google searches was not really satisfying. Most blog posts recommend to use Faker to generate test data. While this certainly is a good starting point, the process of turning the generated test data into DataFrames in those blog posts felt clunky to me. Because I knew that Factory Boy is able to provide factories for data generated by Faker and is used frequently for testing Django apps, I developed the following short and easy-to-apply approach.

Note: The following method is appropriate for generating small- to medium-sized test data. If you want to generate large datasets and performance is critical, you will be better off using mimesis. Additionally, there is an integration for mimesis into factory boy so that the following method is also feasible for large datasets.

Step 1: Prerequisites

Of course, you need to install pandas. Other than that, you do not need to install Faker explicitly; instead, it suffices to install Factory Boy (which in turn has Faker as a dependency). If you use pip or conda, one of the following two commands should suffice.

1pip install factory_boy
Installing Factory Boy using pip
1conda install factory_boy
Installing Factory Boy using conda

Step 2: Define a namedtuple containing (a selection of) the features of your dataset

As its name suggests, a namedtuple is an extended version of a plain python tuple. It is a class with specified attributes and utility methods assisting you to construct instances. Assume that our dataset consists of a name, an account balance (USD) and a birth date in the format YYYY-MM-DD. Based on this, our namedtuple has to look like this.

1from collections import namedtuple
2
3
4Dataset = namedtuple("Dataset", ["name", "account_balance", "birth_date"])

With only one line of code (well, at least without the import statement), we defined a new class Dataset with three attributes and got a lot of goodies for free. Most importantly, namedtuples are compatible with pandas.DataFrames and with Factory Boy.

Step 3: Define a Factory that creates test datasets according to your specifications

In this step, Factory Boy and Faker come into play. Using the Factory class and the Faker wrapper from the factory module, our specification for the dataset is as follows.

 1from factory import Faker, Factory
 2
 3
 4class DatasetFactory(Factory):
 5    """Factory creating test datasets"""
 6
 7    class Meta:
 8        model = Dataset
 9
10    name = Faker("name")
11    account_balance = Faker("pyfloat", left_digits=6, right_digits=2)
12    birth_date = Faker("date_of_birth", minimum_age=18)

First, we tell our Factory in the inner Meta class what object it shall create by assigning our Dataset class to the model attribute. Second and last, we specify what kind of data belongs to which feature of our dataset using Faker providers. In this case, we tell our Factory that the attribute name shall be a name (adhering to the system locale), that account_balance shall be a float of 6 left digits and 2 right digits (as is usual for most currencies) and, finally, that birth_date shall be a date of birth where the minimum age is 18.

Using the Factory

There are three basic uses of our DatasetFactory. First, to use the Factory with the specifications as-is, simple call the standard constructor with no arguments.

1In [4]: DatasetFactory()
2Out[4]: Dataset(name='Karen Dunn', account_balance=621653.75, birth_date=date(1980, 4, 14))
3
4In [5]: DatasetFactory()
5Out[5]: Dataset(name='Karen Murray', account_balance=-97709.61, birth_date=date(1921, 6, 29))
Example output of the DatasetFactory

Second, for certain test cases it might be necessary to assign a fixed value to a attribute. In such cases, you may supply appropriate keyword arguments to the constructor.

1In [6]: DatasetFactory(account_balance=-10000)
2Out[6]: Dataset(name='Danny Casey', account_balance=-10000, birth_date=date(1998, 6, 14))
Fixing values with the DatasetFactory

Third and last, if you wish to generate a batch of test data the class method create_batch will be your tool of choice. You may also supply fixed values as keyword arguments.

 1In [7]: DatasetFactory.create_batch(size=5)
 2Out[7]:
 3[Dataset(name='Amanda Dickerson', account_balance=514402.64, birth_date=date(1908, 5, 26)),
 4 Dataset(name='Katherine Johnson', account_balance=-365522.94, birth_date=date(1907, 12, 12)),
 5 Dataset(name='Christian Stevenson', account_balance=824680.23, birth_date=date(1983, 8, 12)),
 6 Dataset(name='Robert Stewart', account_balance=279501.88, birth_date=date(1954, 4, 19)),
 7 Dataset(name='Melissa Snyder', account_balance=-40896.64, birth_date=date(1941, 1, 6))]
 8
 9In [8]: DatasetFactory.create_batch(size=3, account_balance=500)
10Out[8]:
11[Dataset(name='Tanya Hernandez', account_balance=500, birth_date=date(1996, 11, 29)),
12 Dataset(name='Samuel Boyd', account_balance=500, birth_date=date(1919, 7, 24)),
13 Dataset(name='Jennifer Edwards', account_balance=500, birth_date=date(1978, 1, 5))]
Creating batches

Step 4: Create a test dataframe and supply the DatasetFactory's output

For the last step, we exploit the fact that DataFrames are compatible with namedtuples. Namely, if you call the DataFrame's constructor with a list of namedtuples, pandas will create a DataFrame with columns named after the namedtuple's attributes. As a result, the transformation of a batch of Dataset objects into a DataFrame reduces to one line of code.

1import pandas as pd
2
3
4df = pd.DataFrame(data=DatasetFactory.create_batch(size=10))
It's all coming together: pandas and our DatasetFactory

Here's a sample output.

 1In [5]: df
 2Out[5]:
 3               name  account_balance  birth_date
 40    Abigail Joseph       -186809.54  1941-02-12
 51      Hannah Brown       -332618.35  1930-08-11
 62       Angela Hunt        -60649.82  1905-08-06
 73     Shelby Hudson        445009.65  1986-02-24
 84       Lori Gordon       -921797.72  1912-10-05
 95  Daniel Rodriguez        622570.37  1966-02-14
106      Carol Morris       -964213.50  1914-01-18
117  Jessica Anderson        804757.24  1965-01-06
128  Veronica Edwards       -471469.46  1926-04-22
139      Larry Medina        987186.81  1926-12-12
The final result: Our test dataset as a DataFrame

Additional work after creating the test data

If you want to really make sure that your transformations convert dates correctly you will have to apply an extra step. As it stands now, the column birth_date consists of Python date objects. To convert them to strings of the desired format, you can use the strftime method.

1df["birth_date"] = df["birth_date"].apply(lambda d: d.strftime("%Y-%m-%d"))