Creating test data with Faker and Factory Boy

Creating test data is essential for data scientists and data engineers, especially when working with large datasets that need to be transformed. It goes without saying that such transformations should be tested thoroughly: You do not want to wait for a few minutes for your transformation to finish only to realize you’ve misspelled the column name, applied the wrong formula or applied the right formula to the wrong columns! Consequently, you need to create test data and write test functions that apply your transformations to it.

Naturally, the process of creating such test data is tedious. Even more naturally, data scientists and data engineers first confronted with this process tried to establish best practices so that following generations do not have to waste time. However, what I found out from my google searches was not really satisfying. Most blog posts recommend to use Faker to generate test data. While this certainly is a good starting point, the process of turning the generated test data into DataFrames in those blog posts felt clunky to me. Because I knew that Factory Boy is able to provide factories for data generated by Faker and is used frequently for testing Django apps, I developed the following short and easy-to-apply approach.

Note: The following method is appropriate for generating small- to medium-sized test data. If you want to generate large datasets and performance is critical, you will be better off using mimesis. Additionally, there is an integration for mimesis into factory boy so that the following method is also feasible for large datasets.

Step 1: Prerequisites

Of course, you need to install pandas. Other than that, you do not need to install Faker explicitly; instead, it suffices to install Factory Boy (which in turn has Faker as a dependency). If you use pip or conda, one of the following two commands should suffice.

pip install factory_boy

conda install factory_boy

Step 2: Define a namedtuple containing (a selection of) the features of your dataset

As its name suggests, a namedtuple is an extended version of a plain python tuple. It is a class with specified attributes and utility methods assisting you to construct instances. Assume that our dataset consists of a name, an account balance (USD) and a birth date in the format YYYY-MM-DD. Based on this, our namedtuple has to look like this.

from collections import namedtuple


Dataset = namedtuple("Dataset", ["name", "account_balance", "birth_date"])

With only one line of code (well, at least without the import statement), we defined a new class Dataset with three attributes and got a lot of goodies for free. Most importantly, namedtuples are compatible with pandas.DataFrames and with Factory Boy.

Step 3: Define a Factory that creates test datasets according to your specifications

In this step, Factory Boy and Faker come into play. Using the Factory class and the Faker wrapper from the factory module, our specification for the dataset is as follows.

from factory import Faker, Factory


class DatasetFactory(Factory):
    """Factory creating test datasets"""

    class Meta:
        model = Dataset

    name = Faker("name")
    account_balance = Faker("pyfloat", left_digits=6, right_digits=2)
    birth_date = Faker("date_of_birth", minimum_age=18)

First, we tell our Factory in the inner Meta class what object it shall create by assigning our Dataset class to the model attribute. Second and last, we specify what kind of data belongs to which feature of our dataset using Faker providers. In this case, we tell our Factory that the attribute name shall be a name (adhering to the system locale), that account_balance shall be a float of 6 left digits and 2 right digits (as is usual for most currencies) and, finally, that birth_date shall be a date of birth where the minimum age is 18.

Using the Factory

There are three basic uses of our DatasetFactory. First, to use the Factory with the specifications as-is, simple call the standard constructor with no arguments.

Example output of the DatasetFactory.

In [4]: DatasetFactory()
Out[4]: Dataset(name='Karen Dunn', account_balance=621653.75, birth_date=date(1980, 4, 14))

In [5]: DatasetFactory()
Out[5]: Dataset(name='Karen Murray', account_balance=-97709.61, birth_date=date(1921, 6, 29))

Second, for certain test cases it might be necessary to assign a fixed value to a attribute. In such cases, you may supply appropriate keyword arguments to the constructor.

Fixing values with the DatasetFactory.

In [6]: DatasetFactory(account_balance=-10000)
Out[6]: Dataset(name='Danny Casey', account_balance=-10000, birth_date=date(1998, 6, 14))

Third and last, if you wish to generate a batch of test data the class method create_batch will be your tool of choice. You may also supply fixed values as keyword arguments.

Creating batches.

In [7]: DatasetFactory.create_batch(size=5)
Out[7]:
[Dataset(name='Amanda Dickerson', account_balance=514402.64, birth_date=date(1908, 5, 26)),
 Dataset(name='Katherine Johnson', account_balance=-365522.94, birth_date=date(1907, 12, 12)),
 Dataset(name='Christian Stevenson', account_balance=824680.23, birth_date=date(1983, 8, 12)),
 Dataset(name='Robert Stewart', account_balance=279501.88, birth_date=date(1954, 4, 19)),
 Dataset(name='Melissa Snyder', account_balance=-40896.64, birth_date=date(1941, 1, 6))]

In [8]: DatasetFactory.create_batch(size=3, account_balance=500)
Out[8]:
[Dataset(name='Tanya Hernandez', account_balance=500, birth_date=date(1996, 11, 29)),
 Dataset(name='Samuel Boyd', account_balance=500, birth_date=date(1919, 7, 24)),
 Dataset(name='Jennifer Edwards', account_balance=500, birth_date=date(1978, 1, 5))]

Step 4: Create a test dataframe and supply the DatasetFactory’s output

For the last step, we exploit the fact that DataFrames are compatible with namedtuples. Namely, if you call the DataFrame’s constructor with a list of namedtuples, pandas will create a DataFrame with columns named after the namedtuple’s attributes. As a result, the transformation of a batch of Dataset objects into a DataFrame reduces to one line of code.

import pandas as pd


df = pd.DataFrame(data=DatasetFactory.create_batch(size=10))

Here’s a sample output.

The final result: Our test dataset as a DataFrame.

In [5]: df
Out[5]:
               name  account_balance  birth_date
0    Abigail Joseph       -186809.54  1941-02-12
1      Hannah Brown       -332618.35  1930-08-11
2       Angela Hunt        -60649.82  1905-08-06
3     Shelby Hudson        445009.65  1986-02-24
4       Lori Gordon       -921797.72  1912-10-05
5  Daniel Rodriguez        622570.37  1966-02-14
6      Carol Morris       -964213.50  1914-01-18
7  Jessica Anderson        804757.24  1965-01-06
8  Veronica Edwards       -471469.46  1926-04-22
9      Larry Medina        987186.81  1926-12-12

Additional work after creating the test data

If you want to really make sure that your transformations convert dates correctly you will have to apply an extra step. As it stands now, the column birth_date consists of Python date objects. To convert them to strings of the desired format, you can use the strftime method.

df["birth_date"] = df["birth_date"].apply(lambda d: d.strftime("%Y-%m-%d"))

Classifying SCPs, Part 2: Data transformation (TF-IDF) and preprocessing

After we have obtained data through the means described in the first part of this blog post series, it is time to deal with data transformations and data preprocessing. While humans can comprehend textual information in the form of articles, it is hard to digest for a Machine Learning algorithm. In this blog post, we will transform the textual information into certain vectors assigning a number to each word in the vocabulary of the set of articles: This is what the TF-IDF (Term Frequency – Inverse Document Frequency) is all about.

In comparison to the web crawler post, this one is more mathematical in nature. Instead of evaluating technical approaches and executing them in a test-driven manner, we will have to understand the mathematical background behind the algorithm to put it to good use.

To make the transition as gentle as possible, let us do a warm-up that is closer to the technical spirit of the last blog post: We use the text files produced by the web crawler from the last blog post to extract certain lengths of paragraphs and research in what way these help us to determine the Object Class of the article.

After we have understood how TF-IDF works, we can use it to transform our articles into TF-IDF vectors. Consequently, we will already be able to extract keywords from each article.

Warm-up: Extracting the paragraph lengths

To structure the text from the articles, let us use a custom class.

Basic version of the Article class

class Article(object):
    def __init__(self, label, name, procedures, desc):
        self.label = label.strip()
        self.name = name.strip()
        self.procedures = procedures.strip()
        self.desc = desc.strip()

The logic that splits up the text from the text files into attributes of the class will be a classmethod that accepts a list of lines of text and returns a readily constructed Article instance.

class Article(object):
    # __init__ omitted...

    @classmethod
    def from_text(cls, lines):
        label, *text = lines
        text = "".join(text)
        name, rest = text.split("Special Containment Procedures:")
        procedures, desc = rest.split("Description:")
        return cls(label, name, procedures, desc)

Here’s a basic test that shows how to use the classmethod to obtain an Article instance.

from src.data.article_data import Article


def test_from_text():
    procedures = [
        "Special Containment Procedures: Something...",
        "Something part two...",
    ]
    description = "Description: Something else..."
    article = Article.from_text(["SAFE", "Some name   ", *procedures, description])
    assert article.label == "SAFE"
    assert article.name == "Some name"
    assert article.procedures == "Something...Something part two..."
    assert article.desc == "Something else..."

Validation of the label through a @property

As mentioned in the last first part of this series, we are only concentrating on the labels SAFE, EUCLID and KETER. To account for this, we need to validate that the incoming label is one of those. We are a little more lenient and also accept labels that only start with one of those three labels.

Let us write tests first to define the desired behavior.

import pytest
from src.data.article_data import Article


@pytest.fixture
def article():
    return Article("SAFE", "", "", "")


@pytest.mark.parametrize("label", ["SAFE", "EUCLID", "KETER"])
def test_set_regular_label(article, label):
    article.label = label
    assert article.label == label
    article.label = label + "SOMETHING"
    assert article.label == label


def test_set_unknown_label(article):
    with pytest.raises(ValueError) as excinfo:
        article.label = "unknown"
    assert "unknown" in str(excinfo)

In the tests above, we are using a fixture that gives us an initialized Article instance. Then, we are defining the regular behavior of the setter (we are expecting the label to accept the three main object classes as well as labels that start with those) and what happens when the setter encounters an unknown label (we are expecting a ValueError, enforced via the raises helper).

Because we have not written any validation for the label attribute yet, the tests fail. To account for these kinds of validations, Python has @property decorators that allow for custom getter and setter methods.

class Article(object):

    ALLOWED_LABELS = ("SAFE", "EUCLID", "KETER")

    # __init__ omitted...

    @property
    def label(self):
        return self._label

    @label.setter
    def label(self, orig_label):
        labels = [
            label for label in self.ALLOWED_LABELS if orig_label.startswith(label)
        ]
        if not labels:
            raise ValueError(f"Unknown label '{orig_label}'!")
        self._label = labels.pop()

The Python interpreter calls the method decorated with @label.setter as soon as it encounters the line self.label = label in the __init__ method. As a result, code that uses this class has to deal with ValueErrors when constructing instances.

Adding a to_dict method

While the Article class is responsible for extracting information from the articles, it is much easier to use a simple dictionary when persisting extracted information. That is because the json library can serialize Python dictionaries directly; additionally, the pandas Data Science library is able to use dictionaries to construct their main object: a DataFrame. As a result, we need to write a to_dict method that turns an Article instance into a plain dictionary. Aside from the four attributes of the Article class, we also require the dictionary to contain the (character) lengths of the Procedures and the Descriptions as well as the Ratio of these two lengths.

def test_to_dict_trivial_article(article):
    d = article.to_dict()
    assert "Label" in d
    assert d["Label"] == "SAFE"
    assert "Name" in d
    assert "Procedures" in d
    assert "Description" in d
    assert "Procedures_Length" in d
    assert d["Procedures_Length"] == 0
    assert "Description_Length" in d
    assert d["Description_Length"] == 0
    assert "Procedures_Description_Ratio" in d
    assert d["Procedures_Description_Ratio"] == 0


def test_to_dict(article):
    article.name = "Test"
    article.procedures = "TestTest"
    article.desc = "Test"
    d = article.to_dict()
    assert "Label" in d
    assert d["Label"] == "SAFE"
    assert "Name" in d
    assert d["Name"] == "Test"
    assert "Procedures" in d
    assert d["Procedures"] == "TestTest"
    assert "Description" in d
    assert d["Description"] == "Test"
    assert "Procedures_Length" in d
    assert d["Procedures_Length"] == 8
    assert "Description_Length" in d
    assert d["Description_Length"] == 4
    assert "Procedures_Description_Ratio" in d
    assert d["Procedures_Description_Ratio"] == 2

The implementation is straightforward and uses a dictionary comprehension.

    def to_dict(self):
        return {
            "Label": self.label,
            "Name": self.name,
            "Procedures": self.procedures,
            "Description": self.desc,
            "Procedures_Length": len(self.procedures),
            "Description_Length": len(self.desc),
            "Procedures_Description_Ratio": len(self.procedures) / len(self.desc)
            if len(self.desc) > 0
            else 0,
        }

Using the Article class to process the txt files

Finally, we want to use the Article class to process the text files. More precisely, we would like to aggregate the articles into a pandas DataFrame. This object has a to_json method that allows us to persist it for later introspection.

First, let us write a test to pin down our expectations.

import pandas as pd
from click.testing import CliRunner
from src.data.make_dataset import main

TEST_DATA = {
    "scp-002.txt": """EUCLID\n
Item #: 002\n
Special Containment Procedures: Something something...\n
Description: Something else...\n
""",
    "scp-003.txt": """UNKNOWN\n
Item #: 003\n
Special Containment Procedures: Something something...\n
Description: Something else...\n
""",
}


def test_main():
    runner = CliRunner()
    with runner.isolated_filesystem():
        for filename, text in TEST_DATA.items():
            with open(filename, "w") as f:
                f.write(text)
        result = runner.invoke(main, [".", "."])
        assert result.exit_code == 0
        df = pd.read_json("data.json")
        assert len(df.index) == 1
        data = df.loc[0]
        assert "Label" in data
        assert data["Label"] == "EUCLID"
        assert "Name" in data
        assert data["Name"] == "Item #: 002"
        assert "Procedures" in data
        assert data["Procedures"] == "Something something..."
        assert "Description" in data
        assert data["Description"] == "Something else..."

Here, we are using the dictionary TEST_DATA to write two files with two mock articles. The first is a regular article with a valid object class, the second one is an article we do not wish to process. As a result, we expect that only one article is present in the processed data. Note that we are using pandas’ read_json method to obtain a DataFrame and, in turn, we are using DataFrame methods to assure that only one article is present and that the article data has been split up correctly.

To make this test pass we have to implement the following strategy:

  1. Start with an empty DataFrame.
  2. Parse each text file in the data/raw folder and turn it into an Article.
  3. Use Article’s to_dict method to append the data to the DataFrame.
  4. Persist the DataFrame to a json file in the data/processed folder.

Here’s a sketch implementation that uses the pathlib library’s glob function to iterate through the text files.

import pandas as pd
from pathlib import Path
from src.data.article_data import Article


df = pd.DataFrame({})
for file in Path("data/raw").glob("scp-*.txt"):
    with file.open() as f:
        article = Article.from_text(f.readlines())
    df = df.append(article.to_dict(), ignore_index=True)
df.to_json("data/processed/data.json")

From a software design perspective, this code leaves a lot to be desired. First, there are no log messages that will come in handy when things are off. Second, the paths are hard-coded and should be replaced by function parameters. Third and last, the Article’s classmethod from_text throws a ValueError each time it encounters an article with an unknown object class. We have to deal with this kind of situation without letting the entire script fail.

Here’s a revision of the sketch.

import click
import logging.config
import pandas as pd
from pathlib import Path
from src.data.article_data import Article


PROJECT_DIR = Path(__file__).resolve().parents[2]
logging.config.fileConfig(PROJECT_DIR / "logging_config.ini")
logger = logging.getLogger(__name__)


@click.command()
@click.argument("input_filepath", type=click.Path(exists=True))
@click.argument("output_filepath", type=click.Path())
def main(input_filepath, output_filepath):
    """ Runs data processing scripts to turn raw data from (../raw) into
        cleaned data ready to be analyzed (saved in ../processed).
    """
    logger.info("making final data set from raw data")
    df = pd.DataFrame({})
    for file in Path(input_filepath).glob("scp-*.txt"):
        logger.info("File: %s", file)
        with file.open() as f:
            try:
                article = Article.from_text(f.readlines())
            except ValueError as e:
                logger.warning("ValueError in file %s: %s", file, e)
                continue
        df = df.append(article.to_dict(), ignore_index=True)
    logger.info("DataFrame extracted. Writing to data.json in %s", output_filepath)
    df.to_json(Path(output_filepath) / "data.json")
    logger.info("Done.")


if __name__ == "__main__":
    main()

Note that we are emitting log warning messages whenever we encounter an unknown label but still continue with the processing.

Exercises

Just like in the last blog post, you can rapidly tackle an exercise by using git tags. For instance, if you want to tackle the first exercise, issue the command git tag ex-15 and start coding. If you want to compare your solution to mine, issue git diff sol-ex-15 when you have finished.

Git tag: ex-15   beginner

Add a __repr__ method to the Article class.

Git tag: ex-16   beginner

Add another test for the from_text method. The input shall be the same as in the test_from_text method except that you will leave out the name (“Some name ”). Consequently, assert that the name of the resulting Article instance will be an empty string.

Git tag: ex-17   intermediate

Unfortunately, there are some SCP articles that slightly diverge from the usual naming convention for their parts. For instance, SCP-524 has a Special Containment Procedure (singular!), SCP-2944 has Secure Containment Procedures, and SCP-931 consists of haiku paragraphs. While we could certainly be a little more thorough when parsing them, I will ignore them for the rest of this blog post series (I have encountered 130 warnings when parsing the first 3000 SCP articles which is less than 0,5% of incorrectly parsed articles). However, if you want to, feel free to optimize the parsing procedure. For starters, allow for the “Description” part to start with either “Description:” or “Summary:”. Do not forget to write tests!

Git tag: ex-18   intermediate

Raise RuntimeErrors whenever the Containment Procedures or the Description cannot be extracted. Catch these RuntimeErrors in make_dataset.py, log the error and continue with the for loop without adding the article to the DataFrame. Finally, add another test article in test_make_dataset.py with an unexpected beginning of the description and tests to test_article_data.py to make sure these RuntimeErrors are actually raised.

Quick analysis of the lengths

After we have extracted the (character) lengths of the two parts of the SCP articles, let us analyze them. We will use pandas to load the json file and compute some basic statistical measures.

Open jupyter lab (either by opening a terminal and issuing the command jupyter lab or by opening Anaconda, switching to the environment for the SCP project and open Jupyter Lab there), navigate to the notebooks folder of the SCP project and click the “+” icon above the folder breadcrumbs to fire up a new launcher.

scp2_jupyterlab_launcher.png

In the opening Launcher tab, choose a Python 3 Notebook. Now you are all set up to experiment with data interactively. The following is a transcript of my Jupyter notebook session.

Computing statistics of the lengths

We want to check that all the transformations we have done so far are sane so that we can work with a cleaned up dataset.

  import pandas as pd

  df = pd.read_json("../data/processed/data.json")
  df.head()
Table 1: Out[1]
  Description Description_Length Label Name Procedures Procedures_Description_Ratio Procedures_Length
0 SCP-1256 is a 24-page pamphlet entitled ’Bees … 1837 SAFE Item #: SCP-1256 Mobile Task Force Zeta-4 (’Beekeepers’) is cur… 0.224279 412
1 SCP-2987 is a modified MSI brand external hard… 2187 SAFE Item #: SCP-2987 SCP-2987 is to be kept on floor 17 of Site-88…. 0.203475 445
2 SCP-2039 collectively refers to two distinct f… 5399 EUCLID Item #: SCP-2039 Presently, Foundation efforts at Research Faci… 0.368772 1991
3 SCP-1530 is a two-story abandoned house locate… 3893 EUCLID Item #: SCP-1530 SCP-1530 is currently contained 120 meters fro… 0.201387 784
4 SCP-1524 is the sole remaining specimen of a s… 3211 EUCLID Item #: SCP-1524 Both of SCP-1524’s individual components are t… 0.530364 1703

Let’s look at some statistics of the extracted text lengths and the ratio.

  df.describe()
Table 2: Out[2]
  Description_Length Procedures_Description_Ratio Procedures_Length
count 2700.000000 2700.000000 2700.000000
mean 3208.542222 0.286840 777.595556
std 1658.345674 0.293568 519.808074
min 61.000000 0.000000 0.000000
25% 2104.750000 0.145726 414.750000
50% 2887.000000 0.229935 656.500000
75% 3957.000000 0.353646 994.250000
max 31618.000000 7.377049 7922.000000

Whereas count, mean, min and max are self-explanatory, std stands for standard deviation. The rows with percentages are the 25%-, 50%-, and 75%-quantiles, respectively. They were defined in my Blog post on means and medians. Here’s a short refresher: The 25%-quantile is a value such that 25% of the data is smaller than or equal to it and the other 75% of the data is greater than or equal to it. The 50%-quantile is also known as the median.

The minimum of 61 characters in Description_Length looks reasonable but a Containment Procedure with 0 characters? This has to be investigated. Before we do so, let us look at the same statistics but grouped by each label.

  df.groupby("Label").describe().stack()
Table 3: Out[3]
    Description_Length Procedures_Description_Ratio Procedures_Length
Label        
EUCLID count 1274.000000 1274.000000 1274.000000
  mean 3244.361852 0.308139 855.422292
  std 1701.660229 0.273383 529.896660
  min 428.000000 0.011165 148.000000
  25% 2179.250000 0.169438 497.250000
  50% 2935.500000 0.259065 727.000000
  75% 3977.750000 0.371186 1075.750000
  max 31618.000000 6.051948 7922.000000
KETER count 314.000000 314.000000 314.000000
  mean 3380.487261 0.401208 1128.343949
  std 1694.007237 0.328462 605.260134
  min 233.000000 0.000000 0.000000
  25% 2243.000000 0.218239 683.250000
  50% 3197.500000 0.332694 1028.000000
  75% 4192.250000 0.486212 1449.750000
  max 10141.000000 3.781726 3449.000000
SAFE count 1112.000000 1112.000000 1112.000000
  mean 3118.951439 0.230143 589.388489
  std 1592.721215 0.293088 392.807626
  min 61.000000 0.010626 64.000000
  25% 2003.000000 0.118879 326.000000
  50% 2791.500000 0.178353 488.500000
  75% 3860.750000 0.277565 730.500000
  max 12331.000000 7.377049 3680.000000

This is where it starts to get interesting! As safe SCPs are much easier to contain than euclid ones which in turn are easier to contain than keter SCPs, we expect that the Containment Procedures are easier to describe for safe ones and need more elaborate descriptions for keter ones. On average, this is reflected in the mean length of Containment Procedures (579 for safe, 833 for euclid and 1108 for keter).

Let us turn to the problematic cases of zero lengths.

  df.loc[(df["Procedures_Length"] == 0) | (df["Description_Length"] == 0)]
Table 4: Out[4]
  Description Description_Length Label Name Procedures Procedures_Description_Ratio Procedures_Length
1340 SCP-1994 is the general designation for a set … 1376 KETER Item #: SCP-1994   0.0 0

Thankfully, this is a single outlier. Investigating the article on the SCP Foundation web page and inspecting the html yields that the label “Special Containment Procedures” sits in its own p element so that we were not able to crawl this article correctly.

Let us ignore the outlier.

  df = df.loc[df["Procedures_Length"] > 0]

Finally, let us compute correlations between our features and the target. The correlation coefficient may be computed for number-valued random variables only. Fortunately, the nominal labels safe, euclid, and keter, carry ordinal information. That is to say, we can order them by their containment complexity. To make this even more explicit, let us assign numbers to the three labels. A safe label will be converted to -1, a euclid label to 0 and a keter label to 1 so that the order of the containment complexity is reflected by \(\mathrm{safe} < \mathrm{euclid} < \mathrm{keter}\). However, the magnitude of this conversion is still open for discussion. We could also have choosen \(10^{100}\) for keter and this would have influenced the correlation coefficients. But let’s stick to our simple way of converting for now.

  COMPLEXITY = {
      "SAFE": -1,
      "EUCLID": 0,
      "KETER": 1
  }

  def compute_complexity(label):
      return COMPLEXITY[label]

  df["Complexity"] = df["Label"].apply(compute_complexity)
  df.corr()
Table 5: Out[6]
  Description_Length Procedures_Description_Ratio Procedures_Length Complexity
Description_Length 1.000000 -0.293831 0.220675 0.052532
Procedures_Description_Ratio -0.293831 1.000000 0.577548 0.188953
Procedures_Length 0.220675 0.577548 1.000000 0.344329
Complexity 0.052532 0.188953 0.344329 1.000000

As it turns out, Complexity and Procedures_Length are positively correlated which is precisely what we have observed through the statistics that we have grouped by label. We also see that Description_Length is only very weakly correlated with Complexity: That is to say that there is no reason why, say, a safe SCP should not have a long description or why a keter SCP could not be described in a short manner.

Mathematical background behind TF-IDF

Before we get to enjoy the ease of use of sklearn’s Transformer API to apply the TF-IDF transformation, let’s try and get some understanding of it first. The easiest way to turn words into numbers is to count them. As simple as this sounds, this idea is the corner stone of the TF-IDF transformation.

Word count vectors

Let me make this more precise. Assume that we have articles \(A_1, \dotsc, A_n\). The vocabulary \(\mathcal{V} = \mathcal{V}_{A_1, \dotsc, A_n}\) is the set of unique words occurring in those articles. The set of all of our articles will also be called the document. For any article \(A\), the word count function is \[ \mathrm{wc}_{A}\colon \mathcal{V} \to \mathbb{N}, \, w \mapsto \text{Number of times \(w\) occurs in \(A\)}. \]

Once we have fixed the vocabulary, it is possible to turn the word count functions into word count vectors. First, we merely need to decide on an order of the vocabulary—the alphabetic ordering is a canonical choice! As soon as we have ordered the vocabulary, we may write it as \(\mathcal{V} = \{w_1, \dotsc, w_m\}\), where \(m\) is the total number of words in the vocabulary. Finally, we declare that the word count vector of the article \(A\) is \[ v_A = \bigl(\mathrm{wc}_A(w_1), \dotsc, \mathrm{wc}_A(w_m)\bigr). \]

Normalizing word counts

Instead of only talking about single words, we can more generally deal with terms. These can also be sequences of words that appear in our documents. Depending on how long those sequences are, they may carry information about the context that cannot be inferred from single words alone. Consequently, the word count vectors from before may more generally be called term count vectors (even though this does not seem to be standard language).

In general, vectors have the advantage of being comparable by using distance measures such as the Euclidian distance. Depending on the variety of articles and the precise application, however, there might be a problem. To make this concrete, let me illustrate this by a simple artificial example. Take the sentence “I am sure I am right about this” and transform it into a word count vector. Using alphabetic ordering, you should end up with the following.

word about am I right sure this
count 1 2 2 1 1 1

Let us add a second “article” to our document. The text consists of that same sentence twice. Thus, we obtain

word about am I right sure this
count 2 4 4 2 2 2

as its word count vector.

If you want to consider these two articles as similar or even the same, you will need to normalize the vectors. This means that before comparing two word count vectors you will divide them by their length. In this case, this approach will lead to the two articles being seen as the same. However, there are also reasons for wanting to tell these two articles apart: Even if they deal with the same point, the second one puts stronger emphasis on it by repetition.

To sum it up, depending on the application you might want to think about normalization of your word count vectors. In any case, the resulting word count vectors will be called term frequencies (even if you did not normalize) from now on. This concludes the first half of the TF-IDF transformation.

Inverse document frequency

Term frequency vectors suffer from another problem: Words that certainly occur in almost every English article such as “a”, “the”, and “is” but do not carry meaning will influence the similarity of articles. There are two ways to deal with this problem (in fact, they are often used in conjunction). The first is to simply ignore those words: Create a list of so-called stop words that will be ignored when building the vocabulary. The second way is to penalize words occurring in almost every article and boost rare words. This is precisely what inverse document frequency is doing.

Before we arrive at the precise definition, let us look at it from another angle. A word count is a measure that is local to a single article. This means that it does not depend on other articles in our document. If a single word count is high in that article, this might mean that this is an important word that potentially helps characterizing the article. However, if this word count is equally high in all the other articles then this word does not help us telling this article apart from the others (if everything is special then nothing is). Thus, there is the need for a trade-off between local measures (such as a single word count of a certain article) and global measures.

Inverse document frequency is such a global measure. To define it, let us concentrate on a single word \(w\) in our vocabulary. The document frequency \(\mathrm{df}(w)\) is the number of articles that \(w\) appears in. Consequently, the inverse document frequency is \(1/\mathrm{df}(w)\).

Now we are able to describe the TF-IDF transformation as a whole: For any article \(A\) in the document, multiply each word count in its word count vector with the inverse document frequency. In formulae: \[ \mathrm{TFIDF}(A) = \left(\frac{\mathrm{wc}_A(w_1)}{\mathrm{df}(w_1)}, \dotsc, \frac{\mathrm{wc}_A(w_m)}{\mathrm{df}(w_m)}\right). \]

Applying the TF-IDF transformation (Transcript of jupyter notebook session)

Before we apply the TF-IDF transformation, it is obligatory to put aside some test data for evaluating our model later. Otherwise, a future Machine Learning model would have access to statistics of the entire dataset and may deduce statistics of the test dataset afterwards. However, the entire purpose of the train-test-split is to evaluate the model on data it has not seen before.

  import pandas as pd

  df = pd.read_json("../data/processed/data.json")
  df = df.loc[df["Procedures_Length"] > 0, [
      "Label",
      "Procedures",
      "Description",
      "Procedures_Length",
      "Description_Length",
      "Procedures_Description_Ratio"
  ]]

Making a train-test-split

With sklearn, splitting a DataFrame reduces to calling the train_test_split function from the model_selection module. The test_size argument determines the relative size of the test set.

  from sklearn.model_selection import train_test_split

  X, y = df.drop(columns=["Label"]), df["Label"]
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Note that we split up our target column Label from the rest so that it will not be included in the following transformations.

Fitting TfidfVectorizers

Since we have two text columns (Procedures and Description), it is best to fit two =TfidfVectorizer=s so that all information contained in those two separately will be preserved. The rest of the features should be scaled as certain models encounter numerical problems when two features are on very different scales (that is to say one feature usually is very large, e.g. \(\gg 10^6\), while another only attains values between 0 and 1). To do all of this in one go, sklearn provides us with a ColumnTransformer that takes a list of tuples consisting of a column name and a transformer that should transform the corresponding column. Additionally, the ColumnTransformer’s remainder keyword argument may be another transformer that will be applied to the remaining columns. Here’s how to use it:

  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.compose import ColumnTransformer
  from sklearn.preprocessing import StandardScaler


  columnwise_tfidf = ColumnTransformer(
      [
          (
              "procedures",
              TfidfVectorizer(),
              "Procedures"
          ),
          (
              "desc",
              TfidfVectorizer(),
              "Description"
          )
      ],
      remainder=StandardScaler(),
      n_jobs=-1,
  )

First, the first item in the tuple is a name for the transformation for later reference. Second, the TfidfVectorizer with standard arguments constructs the TF-IDF vectors in almost the same way that I explained it in the Blog Post accompanying this part of the project. The only difference is that the document frequency of each word is increased by one to prevent zero divisions. Third and last, the StandardScaler scales the remaining features such that they have zero mean and unit standard deviation.

Applying this ColumnTransformer to our train set follows the usual sklearn API. Each Transformer has fit and transform methods. Here, the first is used solely on the train set to fit the Transformer. Afterwards, the second may be used to transform both the train and test set.

  columnwise_tfidf.fit(X_train)
  X_train_transformed = columnwise_tfidf.transform(X_train)

Conveniently, most transformers have a fit_transform method that combines these two steps into one:

  X_train_transformed = columnwise_tfidf.fit_transform(X_train)

Extracting keywords

Let us use the fitted transformers to extract keywords from articles. First, we will extract the vocabulary as determined by the =TfidfVectorizer=s. To distinguish between the words from the Procedures and the Description, we will prepend each of them with a prefix.

  def vocabulary():
      return (
          [f"proc__{name}" for name in columnwise_tfidf.named_transformers_["procedures"].get_feature_names()]
          + [f"desc__{name}" for name in columnwise_tfidf.named_transformers_["desc"].get_feature_names()]
      )

Note that the names we have provided for the =TfidfVectorizer=s earlier now come into play.

Second, let’s write a function accepting an article and returning a DataFrame containing the words with the highest frequencies.

  def extract_keywords(article, topn=10):
      article_transformed = columnwise_tfidf.transform(article).toarray()[0]
      frequencies = list(zip(vocabulary(), article_transformed))
      frequencies.sort(key=lambda x: -x[1])
      return pd.DataFrame(frequencies[:topn])

Finally, let’s extract keywords from one of the most iconic SCP articles: The one for SCP-682. This is one of the best examples of Keter class SCPs.

  scp_682 = df.loc[df["Description"].str.startswith("SCP-682")].drop(columns=["Label"])
  extract_keywords(scp_682)
Table 6: Out[8]
  0 1
0 proc__682 0.767357
1 desc__kia 0.738121
2 desc__682 0.523255
3 desc__agent 0.171312
4 desc__personnel 0.156161
5 proc__speak 0.153737
6 proc__acid 0.144138
7 proc__to 0.133515
8 desc__pvt 0.110179
9 proc__scp 0.107281

This does not look too promising. First, maybe numbers should be ignored. Then, there are words “to”, “of” appearing in almost every article in english. “speak” might also not be telling much. This will only get worse if we look at the top 30 keywords.

  extract_keywords(scp_682, topn=30)
Table 7: Out[9]
  0 1
0 proc__682 0.767357
1 desc__kia 0.738121
2 desc__682 0.523255
3 desc__agent 0.171312
4 desc__personnel 0.156161
5 proc__speak 0.153737
6 proc__acid 0.144138
7 proc__to 0.133515
8 desc__pvt 0.110179
9 proc__scp 0.107281
10 desc__handled 0.106319
11 proc__attempts 0.098297
12 proc__reacted 0.095920
13 desc__occurrence 0.095232
14 proc__incapacitation 0.091120
15 proc__of 0.090828
16 proc__fear 0.087715
17 proc__rage 0.087715
18 proc__hydrochloric 0.085073
19 proc__massive 0.085073
20 proc__frequent 0.082915
21 proc__provoking 0.082915
22 proc__breach 0.082463
23 desc__scp 0.081648
24 proc__should 0.080923
25 proc__lining 0.079510
26 proc__called 0.078116
27 proc__incapacitated 0.078116
28 proc__force 0.078011
29 proc__destroying 0.076869

Fine-tuning the TfidfVectorizer

Fortunately, TfidfVectorizer has a lot of options to fine-tune its behavior. First and maybe most importantly, we can enforce that certain words should be ignored via the stop_words keyword argument. It either expects the string “english” and then uses a list constructed by the sklearn developers (with its own set of disadvantages) or it expects a list of strings containing the words that shall be ignored. Second, we can specify a regex pattern via the token_pattern keyword argument. This pattern will be used when parsing the articles to build up the vocabulary. The standard pattern includes single words containing letters and numbers; we will modify it to only parse for words containing letters.

  columnwise_tfidf = ColumnTransformer(
      [
          (
              "procedures",
              TfidfVectorizer(
                  stop_words="english",
                  strip_accents='unicode',
                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
              ),
              "Procedures"
          ),
          (
              "desc",
              TfidfVectorizer(
                  stop_words="english",
                  strip_accents='unicode',
                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b'
              ),
              "Description"
          )
      ],
      remainder=StandardScaler()
  )

  columnwise_tfidf.fit(X_train)

Listing 1. Out[10].

  ColumnTransformer(n_jobs=None,
                    remainder=StandardScaler(copy=True, with_mean=True,
                                             with_std=True),
                    sparse_threshold=0.3, transformer_weights=None,
                    transformers=[('procedures',
                                   TfidfVectorizer(analyzer='word', binary=False,
                                                   decode_error='strict',
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1...
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1,
                                                   ngram_range=(1, 1), norm='l2',
                                                   preprocessor=None,
                                                   smooth_idf=True,
                                                   stop_words='english',
                                                   strip_accents='unicode',
                                                   sublinear_tf=False,
                                                   token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
                                                   tokenizer=None, use_idf=True,
                                                   vocabulary=None),
                                   'Description')],
                    verbose=False)

  extract_keywords(scp_682, topn=30)
Table 8: Out[11]
  0 1
0 desc__kia 0.890278
1 proc__speak 0.272335
2 proc__acid 0.255331
3 desc__agent 0.206627
4 proc__scp 0.190041
5 desc__personnel 0.188352
6 proc__attempts 0.174127
7 proc__reacted 0.169915
8 proc__incapacitation 0.161413
9 proc__fear 0.155381
10 proc__rage 0.155381
11 proc__hydrochloric 0.150702
12 proc__massive 0.150702
13 proc__frequent 0.146879
14 proc__provoking 0.146879
15 proc__breach 0.146078
16 proc__lining 0.140847
17 proc__called 0.138377
18 proc__incapacitated 0.138377
19 proc__force 0.138192
20 proc__destroying 0.136168
21 proc__containment 0.135959
22 desc__pvt 0.132891
23 proc__difficulty 0.132345
24 proc__submerged 0.132345
25 proc__best 0.130666
26 desc__handled 0.128236
27 proc__chamber 0.126861
28 proc__plate 0.125041
29 proc__development 0.123843

This looks much better. A few remarks:

  • I had to google for the two abbreviations “kia” and “pvt”. The first is the abbreviation for “killed in action” while the second stands for the military rank “Private”.
  • On second thought, “speak” may contain the information that the SCP object is able to speak and, thusly, might hint at it being sapient. As sapient SCPs are probably more likely to be of class euclid or keter, this could be valuable information for a Machine Learning model.
  • One could start building a custom list of stop words more suitable for parsing SCP articles. In the list above, the words “best” and “called” as well as “scp” could be ignored. I will postpone this to the next part of this series of posts. Because some models give some insight in their learning process, we can use them to see if their decisions are based on filler words.

Conclusion

In this blog post, we have learned how to use Jupyter Notebooks and the pandas library to extract basic statistics from SCP articles. Furthermore, we have used a basic TF-IDF transformation to extract keywords from SCP articles.

Effective Python 2nd Edition: What’s new?

Effective Python by Brett Slatkin is a book filled with best practices of the Python programming language. I devoured the first edition as an ebook and was eager to buy the second edition as a physical book. Having skimmed through it, I was already satisfied. The new edition features updates, the removal of all Python 2 specific hints and workarounds. More specifically, it concentrates on language features of Python 3 up to and including Python 3.8.

However, I was a little confused that there was no tabular information about the additions and the changes. Certainly, it looks like each item was changed in one way or another (as most items in the first edition contained workarounds for Python 2). But if you are like me, owning the first edition of the book and wondering if the new content will be worth it, this blog post got you covered.

In case you want to do the comparison yourself, there is a table of contents of the 2nd edition on the main page of the official web site of the book.

Entirely new items in Effective Python 2nd Edition

There are 27 entirely new items.

  • 4. Prefer Interpolated F-Strings Over C-style Format Strings and str.format
  • 6. Prefer Multiple Assignment Unpacking Over Indexing
  • 10. Prevent Repetition with Assignment Expressions
  • 13. Prefer Catch-All Unpacking Over Slicing
  • 14. Sort by Complex Criteria Using the key Parameter
  • 15. Be Cautious When Relying on dict Insertion Ordering
  • 16. Prefer get Over in and KeyError to Handle Missing Dictionary Keys
  • 17. Prefer defaultdict Over setdefault to Handle Missing Items in Internal State
  • 18. Know How to Construct Key-Dependent Default Values with __missing__
  • 19. Never Unpack More Than Three Variables When Functions Return Multiple Values
  • 29. Avoid Repeated Work in Comprehensions by Using Assignment Expressions
  • 33. Compose Multiple Generators with yield from
  • 34. Avoid Injecting Data into Generators with send
  • 35. Avoid Causing State Transitions in Generators with throw
  • 36. Consider itertools for Working with Iterators and Generators
  • 51. Prefer Class Decorators Over Metaclasses for Composable Class Extensions
  • 56. Know How to Recognize When Concurrency Is Necessary
  • 57. Avoid Creating New Thread Instances for On-demand Fan-out
  • 58. Understand How Using Queue for Concurrency Requires Refactoring
  • 59. Consider ThreadPoolExecutor When Threads Are Necessary for Concurrency
  • 61. Know How to Port Threaded I/O to asyncio
  • 62. Mix Threads and Coroutines to Ease the Transition to asyncio
  • 63. Avoid Blocking the asyncio Event Loop to Maximize Responsiveness
  • 74. Consider memoryview and bytearray for Zero-Copy Interactions with bytes
  • 79. Encapsulate Dependencies to Facilitate Mocking and Testing
  • 89. Consider warnings to Refactor and Migrate Usage
  • 90. Consider Static Analysis via typing to Obviate Bugs

Items heavily updated in Effective Python 2nd Edition

Most updates to existing items are to only cover code samples for Python 3.7 with some notable exceptions that show 3.8 exclusive samples (the most prominent being the introduction to the walrus operator in items 10 and 29). Most noteworthy, the first item Know Which Version of Python You’re Using already makes it clear at the end of the first paragraph:

This book does not cover Python 2.

Keeping this in mind, we expect to see some updates. For instance, Item 3: Know the Differences Between bytes and str does not mention the Python 2 exclusive unicode anymore. Even so, there are certain items that have been updated so much that they contain new advice because of newly added language features. In particular, I want to mention the following items in this regard.

  • 48. Validate Subclasses with __init_subclass__ (former Item 33: Validate Subclasses with Metaclasses)
  • 49. Register Class Existence with __init_subclass__ (former Item 34: Register Class Existence with Metaclasses)
  • 50. Annotate Class Attributes with __set_name__ (former Item 35: Annotate class attributes with Metaclasses)
  • 60. Achieve Highly Concurrent I/O with Coroutines (former Item 40: Consider Coroutines to Run Many Functions Concurrently)

While the first three items in this list introduce new language features restricting the use cases for Metaclasses, the last one was updated to show the new way of defining Coroutines using the asyncio built-in module. Since this last item and the preceding items 56-59 feature Conway’s Game of Life as an example, you might also argue that item 60 belongs in the next section.

Items that have been split up into multiple more elaborate items

There are two items from the first edition that have been split into multiple items in the 2nd edition:

  • Item 46: Use Built-In Algorithms and Data Structures has been split up into
    • 71. Prefer deque for Producer–Consumer Queues
    • 72. Consider Searching Sorted Sequences with bisect
    • 73. Know How to Use heapq for Priority Queues
  • Item 56: Test Everything with Unittest has been split up into
    • 76. Verify Related Behaviors in TestCase Subclasses
    • 77. Isolate Tests from Each Other with setUp, tearDown, setUpModule, and tearDownModule
    • 78. Use Mocks to Test Code with Complex Dependencies

Item 46 from the first edition had more of an overview character, introducing data types from built-in modules. The corresponding items from the 2nd edition are more elaborate and provide an in-depth view with more code samples and example use cases.

Considering the importance of testing generally and particularly in dynamically typed languages such as Python, I found item 56 from the first edition to be too brief. In contrast, the new items are much more elaborate on the best practices in testing. Above all, item 78 about Mocks is a precious addition, giving an example of how to use the unittest.mock built-in module to write unit tests for code depending on a database connection.

Conclusion

To sum it up, Effective Python 2nd Edition adds a lot of new content in comparison to its first edition. More precisely, there are 27 entirely new items. Additionally, two items have been split up into multiple, more elaborate items so that the new edition clocks in at 90 items in comparison to 59 items from the first edition.