The job system of Octopath Traveler and the symmetric group

Octopath Traveler is an RPG where you take control of eight characters, each having a unique job with a unique set of skills. In the course of the game you may give each character a secondary job (but you can’t endow them with the same job twice). Since a party may only consist of four of these characters and it is advantageous to have access to ablities from all eight jobs, I wondered: Given a distribution of secondary jobs, is it always possible to find a party of four characters having access to all eight jobs?

It was possible for the distribution of secondary jobs I have used but in general the answer is no. To see why and to find out what kinds of distributions give rise to a favorable party required me to take a deep dive into basic abstract algebra again.

This blog post is a two-fold testimony: For one, it shows a new peak in my nerdyness and also, it shows that even after years of studying maths it is still possible to stumble upon interesting problems in very basic areas.

A gentle motivation for the symmetric group

The eight jobs in Octopath Traveler are as follows.

  1. Hunter
  2. Warrior
  3. Scholar
  4. Cleric
  5. Dancer
  6. Apothecary
  7. Thief
  8. Merchant

Choosing secondary jobs needs to follow two rules:

  1. No secondary job may be chosen twice.
  2. The secondary job must not coincide with the primary job.

Let us also agree that we actually want to endow each character with a secondary job; the game does not enforce this in any way, but it is definitely advantageous to do so. In combination with the first rule, this means that each job has to be chosen as a secondary job exactly once. To sum it up, what we want is an assignment that maps each job to a different one. This is a mathematical function that happens to be a bijection because of the first rule. The second rule makes this a special kind of bijection because it must move each element of the source. Furthermore, because we want to search for favorable assignments that leave us with a choice of a party having access to all eight jobs, the bijection becomes even more special.

Mathematical precision

To make the above thoughts mathematically precise, let us first do the common thing mathematicians do and throw away unnecessary details. For starters, let’s replace the concrete job names by numbers \(1, \dotsc, 8\). Then, a distribution of secondary jobs corresponds to a certain bijection \(\{1, \dotsc, 8\} \to \{1, \dotsc, 8\}\). This means that we are looking for certain elements of the symmetric group \(S_{8}\). But why stop at 8? Let us replace 8 by an arbitrary even number \(2n\). Also, to make notation a little easier to read and write, let us define \([2n] := \{1, \dotsc, 2n\}\).

Definition. An element \(\sigma \in S_{2n}\) is bisectable if there is a subset \(B \subset [2n]\) such that \(\sigma(B) = [2n] \setminus B\).

Equivalently, we may also require that \([2n]\) is the disjoint union \(B \cup \sigma(B)\). Because \(\sigma\) induces a bijection \(B \to \sigma(B)\), we deduce that \[ 2n = |[2n]| = |B \cup \sigma(B)| = |B| + |\sigma(B)| = 2|B| \] so that \(|B| = n\) where \(|B|\) denotes the number of elements of the set \(B\).

Let’s validate this abstraction by quickly turning back to Octopath Traveler: If we use a bijection \(\sigma\colon [8] \to [8]\) to assign secondary jobs to primary jobs, we will be able to build up a party of four characters having access to all eight jobs if and only if the bijection is bisectable. Indeed, if we can find such a party, the bijection needs to map their four jobs to the four other jobs so that we can use \(B\) as the set of the four primary jobs to prove that the bijection is bisectable. On the other hand, if the bijection is bisectable, choose \(B\) as in the definition above. Then, build up a party of characters having the four jobs that are elements of \(B\). Because the secondary jobs of those characters are then the elements of \(\sigma(B)\), all eight jobs are present in this party.

Some basic notations for elements of the symmetric group

Let us embark on our quest for classifying bisectable elements of the symmetric group. First, let me quickly talk about convenient notations for elements of the symmetric group. The first standard way of writing down an element is a table with two rows: The header consists of the numbers \(1, \dotsc, 2n\) and the second row consists of the images of those numbers. Here is an example for the case \(2n = 8\):

1 2 3 4 5 6 7 8
2 3 4 5 6 1 8 7

Using this representation, we see that we need to make sure that each number \(1, \dotsc, 2n\) needs to appear in the second row exactly once. If we want to validate the rule that the secondary job must not coincide with the primary job, we can simply check that no column contains the same number twice. If we agree to keep the header always in mind, we might even reduce this representation by specifying elements using only the second row containing the values.

The cycle representation of a bijection is more efficient and describes what happens when we apply the bijection repeatedly. We start with one element and repeatedly apply the bijection to it until we return to the initial element. If we have not encountered all elements yet, we choose one of the remaining elements as the new starting point and repeat the procedure until we have encountered them all. During this process, we keep track of all the encountered elements by noting them down and putting brackets around the encountered elements when we return to the initial element. For the bijection above, say, we would start with the element \(1\), apply the bijection, obtain \(2\), and so on until we obtain \(6\) and realize that applying the bijection to it returns us to \(1\). Since we haven’t encountered \(7\) yet, we do the same here and note that it gets mapped to \(8\) which then sends us back to where we started. The way we would then write this down is \[(1\ 2\ 3\ 4\ 5\ 6)(7\ 8).\]

Naturally, this also gives rise to a (directed) graph representation of an element of the symmetric group. The vertices are the numbers \(1, \dotsc, 2n\) and there is an edge from \(x\) to \(y\) if the bijection maps \(x\) to \(y\). For the example above, this yields the following.

cycle_example.png

Observation. Each element of the symmetric group is the composition of disjoint cycles.

We end this section with a brief look into a basic notion of group theory that we will need when classifying bisectable elements.

Definition. The order of an element \(\sigma \in S_{2n}\) is the number of elements in the generated subgroup \(\langle\sigma\rangle := \{\sigma^i \mid i \geq 0\}\).

Lemma. Let \(\rho = (k_1\ \dotsb\ k_\ell)\) be a cycle. Then the order of \(\rho\) coincides with the length \(\ell\).

Proof. First, we show \(\rho^{\ell} = \mathrm{id}\). The cycle representation tells us precisely what happens when we apply \(\rho\) repeatedly; if we apply \(\rho\) \(\ell\) times, we move from the initial element \(\ell\) times to the right, jumping back to the beginning if necessary. Doing this to the element \(k_i\) and applying \(\rho\) \(\ell-i\) times sends us to \(k_{\ell}\). Applying \(\rho\) once more gives us \(k_1\), applying it a second time yield \(k_2\) so that applying \(\rho\) the remaining \(i\) times returns us to where we started, \(k_i\). All in all, this shows that the order is \(\leq \ell\). Second and last, we need to show that all the elements \(\rho^0, \dotsc, \rho^{\ell-1}\) are different. But this is evident since they map \(k_1\) to different elements since \(\rho^i(k_{1}) = k_{i+1}\) for \(0 \leq i \leq \ell – 1\).

Which elements are bisectable?

With cycle notation and the graph representation, we can quickly see what it means for an element to be bisectable. We need to mark half of the elements in such a way that they point at elements that are not marked. In the example of the last section, we could mark the numbers \(2, 4, 6, 8\) for example to see that the bijection is, in fact, bisectable. Here is an example of a bijection consisting of two cycles that is not bisectable:

non_bisectable_cycle.png

These two examples already tell the whole story.

Theorem. An element \(\sigma\) of the symmetric group \(S_{2n}\) is bisectable if and only if it is a derangement (i.e., \(\sigma(x) \neq x\) for all \(x \in [2n]\)) and is a composition of disjoint cycles of even lengths.

Proof. If we start with a derangement \(\sigma\) composed of disjoint cycles of even lengths, we may collect every other element of each cycle into a set \(B\). More precisely, let \((k_1\ \dotsb\ k_{2i})\) be a cycle of even length \(2i\). Then, select the elements \(k_2, k_4, \dotsc, k_{2i}\). Repeat this process for each of \(\sigma\)’s cycles. This way, we selected \(n\) elements that get mapped to a disjoint set \(\sigma(B)\) also containing \(n\) elements so that \([2n]\) is the disjoint union of \(B \cup \sigma(B)\).

For the other implication, I’d like to use some group theory tools, namely the orbit-stabilizer theorem, because we would have to do a quite involved argument with an odd-length cycle to derive a contradiction.

To dive into the theory of group actions, let \(\mathcal{P}([2n])\) be the set of subsets of \([2n]\) and consider the map \[ \cdot: S_{2n} \times \mathcal{P}([2n]) \to \mathcal{P}([2n]), \, (\rho, A) \mapsto (\rho \cdot A) := \rho(A). \] This is a group action which means that \(\mathrm{id} \cdot A = A\) and \(\rho \cdot (\rho’ \cdot A) = \rho\rho’ \cdot A\) for all subsets \(A\) and elements \(\rho, \rho’\).

We restrict this group action to the subgroup \(\langle \sigma \rangle\) generated by our given bisectable element \(\sigma\). Because \(\sigma\) is bisectable, we may choose \(B \in \mathcal{P}([2n])\) such that \(\sigma \cdot B = \sigma(B) = [2n] \setminus B\). Applying \(\sigma\) once more yields \[ \sigma^2 \cdot B = \sigma(\sigma(B)) = \sigma([2n] \setminus B) = [2n] \setminus \sigma(B) = [2n] \setminus ([2n] \setminus B) = B. \] The same equation \(\rho^2 \cdot B = B\) holds for any of \(\sigma\)’s cycles \(\rho\): If \(x \in B\) is not part of the cycle \(\rho\), we have \(\rho(x) = x\) so that \(\rho^2(x) = x \in B\). If \(x \in B\) is part of the cycle then so is \(\rho(x)\) so that \(\rho(x) = \sigma(x)\) and \(\rho^2(x) = \sigma^2(x) \in B\).

This shows that the orbit \(\langle\rho\rangle \cdot B = \{\rho^i \cdot B \mid i \geq 0\}\) consists of the two sets \(B\) and \(\rho(B)\). Now we can apply the orbit-stabilizer theorem to see that \(\rho\) must be of even length: It tells us that \[ 2|\langle\rho\rangle_B| = |\langle\rho\rangle \cdot B| |\langle\rho\rangle_B| = |\langle\rho\rangle| \] The right-hand-side coincides with the order of \(\rho\) which in turn coincides with the length. Thus, the length has to be divisible by two.

To wrap it up we still need to show that \(\sigma\) has to be a derangement. Let \(x \in [2n]\). If \(x \in B\), then \(\sigma(x) \in [2n] \setminus B\) so that \(\sigma(x) \neq x\). If \(x \in [2n] \setminus B\), then \(\sigma(x) \in \sigma^2(B) = B\) so that \(\sigma(x) \neq x\) as well and we are done.

How many elements are bisectable?

Now that we have understood precisely what kinds of elements are bisectable, we are able to count them. I’ve restricted myself to the case of \(2n = 8\) here. In this case, we are trying to find out how many of the 14833 derangements in total are bisectable (I won’t give a derivation of that number here, but the wikipedia link contains a proof).

Let’s try the following strategy. We order the bisectable elements by the lengths of their cycles. Since we know that we want to obtain a derangement consisting entirely of cycles of even lengths, this boils down to partition the number \(8\) into a sum of even numbers. Here are all possibilities:

\begin{align*}
8 &= 8 \\
&= 6 + 2 \\
&= 4 + 4 \\
&= 4 + 2 + 2 \\
&= 2 + 2 + 2 + 2
\end{align*}

Counting \(8\)-cycles

Fixing elements \(k_1, \dotsc, k_{\ell}\), how many possible cycles containing those elements are there? Since the cycle representation \((k_1\ \dotsb\ k_{\ell})\) is stable under shifts, i.e., \((k_{\ell}\ k_1\ \dotsb\ k_{\ell-1})\) represents the same cycle and there are \(\ell\) different shifts, we can argue as follows. There are \(\ell!\) possibilities to write down the elements. Dividing this number by \(\ell\) accounts for the shifts. Thus, we arrive at \((\ell-1)!\) cycles.

Applying this to \(\ell = 8\), we see that there are \(7! = 5040\) different cycles of length \(8\).

Counting compositions of \(6\)- and \(2\)-cycles

First, we need to choose the \(6\) elements that belong to the \(6\)-cycle. The remaining two elements then will make up the \(2\)-cycle. There are \(\binom{8}{6} = 28\) ways to choose \(6\) elements from the \(8\) possible ones. For the resulting \(6\)-cycle there are \(5! = 120\) possibilities. Combining these, leaves us with \(28 \cdot 120 = 3360\) possibilities.

Counting compositions of two \(4\)-cycles

Again, we need to choose the \(4\) elements that belong to one of the \(4\)-cycles. There are \(\binom{8}{4} = 70\) possibilities. Next, for each of the \(4\)-cycles there are \(3! = 6\) possibilities once the elements are fixed. Last, we need to account for counting each constellation twice: For instance, the representations \((1\ 2\ 3\ 4)(5\ 6\ 7\ 8)\) and \((5\ 6\ 7\ 8)(1\ 2\ 3\ 4)\) represent the same element but we have counted it twice. Thus, we need to divide by \(2\) as a final step. This leaves us with \(70 \cdot 6 \cdot 6 / 2 = 1260\) possibilities.

Counting compositions of a \(4\)-cycle with two \(2\)-cycles.

The pattern should be clear by now:

  • Choose \(4\) elements for the \(4\)-cycle. Possibilities: \(\binom{8}{4} = 70\).
  • Of the remaining \(4\) elements, choose \(2\) for one of the \(2\)-cycles. Possibilties: \(\binom{4}{2} = 6\).
  • For the \(4\)-cycle there are \(3! = 6\) possibilities once the elements are fixed. For the \(2\)-cycles, there is only \(1! = 1\) each.
  • Account for counting each resulting element twice by dividing by \(2\).

All in all, we obtain \(1260\) possibilities.

Counting compositions of four \(2\)-cycles.

Nothing new in this computation save for the fact that we need to divide by \(4! = 24\) at the end to account for the order of the four \(2\)-cycles.

All in all, we obtain \(105\) possibilities.

Putting it all together

Summing all the possibilities, we obtain \(11025\) bisectable elements in \(S_8\). This means that the majority of the \(14833\) derangements is bisectable! More precisely, if we choose a derangement randomly we will have a probability of \(11025/14833 \sim 0.743 = 74.3\%\) of the elements being bisectable.

Some closing thoughts about choosing the bisectable element for Octopath Traveler

If we want to apply this to Octopath Traveler there will be two things to note:

  1. For each cycle, there are two ways to choose which elements should belong to the bisecting set. Thus, to maximize the number of parties having access to all eight jobs, it can be favorable to only use \(2\)-cycles. Considering that the character you started with has to stay in the party for the majority of the game, this leaves you with \(8\) possible parties.
  2. When choosing the \(2\)-cycles, i.e., the pairs of characters that swap jobs, it might be favorable to consider their path actions. For instance, putting the huntress and the warrior into a \(2\)-cycle allows you to use either of them in your party to have access to both huntress and warrior skills as well as use their challenge or provoke path actions. Following this train of thought leads me to the following bisectable element.

    final_element.png

Creating test data with Faker and Factory Boy

Creating test data is essential for data scientists and data engineers, especially when working with large datasets that need to be transformed. It goes without saying that such transformations should be tested thoroughly: You do not want to wait for a few minutes for your transformation to finish only to realize you’ve misspelled the column name, applied the wrong formula or applied the right formula to the wrong columns! Consequently, you need to create test data and write test functions that apply your transformations to it.

Naturally, the process of creating such test data is tedious. Even more naturally, data scientists and data engineers first confronted with this process tried to establish best practices so that following generations do not have to waste time. However, what I found out from my google searches was not really satisfying. Most blog posts recommend to use Faker to generate test data. While this certainly is a good starting point, the process of turning the generated test data into DataFrames in those blog posts felt clunky to me. Because I knew that Factory Boy is able to provide factories for data generated by Faker and is used frequently for testing Django apps, I developed the following short and easy-to-apply approach.

Note: The following method is appropriate for generating small- to medium-sized test data. If you want to generate large datasets and performance is critical, you will be better off using mimesis. Additionally, there is an integration for mimesis into factory boy so that the following method is also feasible for large datasets.

Step 1: Prerequisites

Of course, you need to install pandas. Other than that, you do not need to install Faker explicitly; instead, it suffices to install Factory Boy (which in turn has Faker as a dependency). If you use pip or conda, one of the following two commands should suffice.

pip install factory_boy

conda install factory_boy

Step 2: Define a namedtuple containing (a selection of) the features of your dataset

As its name suggests, a namedtuple is an extended version of a plain python tuple. It is a class with specified attributes and utility methods assisting you to construct instances. Assume that our dataset consists of a name, an account balance (USD) and a birth date in the format YYYY-MM-DD. Based on this, our namedtuple has to look like this.

from collections import namedtuple


Dataset = namedtuple("Dataset", ["name", "account_balance", "birth_date"])

With only one line of code (well, at least without the import statement), we defined a new class Dataset with three attributes and got a lot of goodies for free. Most importantly, namedtuples are compatible with pandas.DataFrames and with Factory Boy.

Step 3: Define a Factory that creates test datasets according to your specifications

In this step, Factory Boy and Faker come into play. Using the Factory class and the Faker wrapper from the factory module, our specification for the dataset is as follows.

from factory import Faker, Factory


class DatasetFactory(Factory):
    """Factory creating test datasets"""

    class Meta:
        model = Dataset

    name = Faker("name")
    account_balance = Faker("pyfloat", left_digits=6, right_digits=2)
    birth_date = Faker("date_of_birth", minimum_age=18)

First, we tell our Factory in the inner Meta class what object it shall create by assigning our Dataset class to the model attribute. Second and last, we specify what kind of data belongs to which feature of our dataset using Faker providers. In this case, we tell our Factory that the attribute name shall be a name (adhering to the system locale), that account_balance shall be a float of 6 left digits and 2 right digits (as is usual for most currencies) and, finally, that birth_date shall be a date of birth where the minimum age is 18.

Using the Factory

There are three basic uses of our DatasetFactory. First, to use the Factory with the specifications as-is, simple call the standard constructor with no arguments.

Example output of the DatasetFactory.

In [4]: DatasetFactory()
Out[4]: Dataset(name='Karen Dunn', account_balance=621653.75, birth_date=date(1980, 4, 14))

In [5]: DatasetFactory()
Out[5]: Dataset(name='Karen Murray', account_balance=-97709.61, birth_date=date(1921, 6, 29))

Second, for certain test cases it might be necessary to assign a fixed value to a attribute. In such cases, you may supply appropriate keyword arguments to the constructor.

Fixing values with the DatasetFactory.

In [6]: DatasetFactory(account_balance=-10000)
Out[6]: Dataset(name='Danny Casey', account_balance=-10000, birth_date=date(1998, 6, 14))

Third and last, if you wish to generate a batch of test data the class method create_batch will be your tool of choice. You may also supply fixed values as keyword arguments.

Creating batches.

In [7]: DatasetFactory.create_batch(size=5)
Out[7]:
[Dataset(name='Amanda Dickerson', account_balance=514402.64, birth_date=date(1908, 5, 26)),
 Dataset(name='Katherine Johnson', account_balance=-365522.94, birth_date=date(1907, 12, 12)),
 Dataset(name='Christian Stevenson', account_balance=824680.23, birth_date=date(1983, 8, 12)),
 Dataset(name='Robert Stewart', account_balance=279501.88, birth_date=date(1954, 4, 19)),
 Dataset(name='Melissa Snyder', account_balance=-40896.64, birth_date=date(1941, 1, 6))]

In [8]: DatasetFactory.create_batch(size=3, account_balance=500)
Out[8]:
[Dataset(name='Tanya Hernandez', account_balance=500, birth_date=date(1996, 11, 29)),
 Dataset(name='Samuel Boyd', account_balance=500, birth_date=date(1919, 7, 24)),
 Dataset(name='Jennifer Edwards', account_balance=500, birth_date=date(1978, 1, 5))]

Step 4: Create a test dataframe and supply the DatasetFactory’s output

For the last step, we exploit the fact that DataFrames are compatible with namedtuples. Namely, if you call the DataFrame’s constructor with a list of namedtuples, pandas will create a DataFrame with columns named after the namedtuple’s attributes. As a result, the transformation of a batch of Dataset objects into a DataFrame reduces to one line of code.

import pandas as pd


df = pd.DataFrame(data=DatasetFactory.create_batch(size=10))

Here’s a sample output.

The final result: Our test dataset as a DataFrame.

In [5]: df
Out[5]:
               name  account_balance  birth_date
0    Abigail Joseph       -186809.54  1941-02-12
1      Hannah Brown       -332618.35  1930-08-11
2       Angela Hunt        -60649.82  1905-08-06
3     Shelby Hudson        445009.65  1986-02-24
4       Lori Gordon       -921797.72  1912-10-05
5  Daniel Rodriguez        622570.37  1966-02-14
6      Carol Morris       -964213.50  1914-01-18
7  Jessica Anderson        804757.24  1965-01-06
8  Veronica Edwards       -471469.46  1926-04-22
9      Larry Medina        987186.81  1926-12-12

Additional work after creating the test data

If you want to really make sure that your transformations convert dates correctly you will have to apply an extra step. As it stands now, the column birth_date consists of Python date objects. To convert them to strings of the desired format, you can use the strftime method.

df["birth_date"] = df["birth_date"].apply(lambda d: d.strftime("%Y-%m-%d"))

Classifying SCPs, Part 2: Data transformation (TF-IDF) and preprocessing

After we have obtained data through the means described in the first part of this blog post series, it is time to deal with data transformations and data preprocessing. While humans can comprehend textual information in the form of articles, it is hard to digest for a Machine Learning algorithm. In this blog post, we will transform the textual information into certain vectors assigning a number to each word in the vocabulary of the set of articles: This is what the TF-IDF (Term Frequency – Inverse Document Frequency) is all about.

In comparison to the web crawler post, this one is more mathematical in nature. Instead of evaluating technical approaches and executing them in a test-driven manner, we will have to understand the mathematical background behind the algorithm to put it to good use.

To make the transition as gentle as possible, let us do a warm-up that is closer to the technical spirit of the last blog post: We use the text files produced by the web crawler from the last blog post to extract certain lengths of paragraphs and research in what way these help us to determine the Object Class of the article.

After we have understood how TF-IDF works, we can use it to transform our articles into TF-IDF vectors. Consequently, we will already be able to extract keywords from each article.

Warm-up: Extracting the paragraph lengths

To structure the text from the articles, let us use a custom class.

Basic version of the Article class

class Article(object):
    def __init__(self, label, name, procedures, desc):
        self.label = label.strip()
        self.name = name.strip()
        self.procedures = procedures.strip()
        self.desc = desc.strip()

The logic that splits up the text from the text files into attributes of the class will be a classmethod that accepts a list of lines of text and returns a readily constructed Article instance.

class Article(object):
    # __init__ omitted...

    @classmethod
    def from_text(cls, lines):
        label, *text = lines
        text = "".join(text)
        name, rest = text.split("Special Containment Procedures:")
        procedures, desc = rest.split("Description:")
        return cls(label, name, procedures, desc)

Here’s a basic test that shows how to use the classmethod to obtain an Article instance.

from src.data.article_data import Article


def test_from_text():
    procedures = [
        "Special Containment Procedures: Something...",
        "Something part two...",
    ]
    description = "Description: Something else..."
    article = Article.from_text(["SAFE", "Some name   ", *procedures, description])
    assert article.label == "SAFE"
    assert article.name == "Some name"
    assert article.procedures == "Something...Something part two..."
    assert article.desc == "Something else..."

Validation of the label through a @property

As mentioned in the last first part of this series, we are only concentrating on the labels SAFE, EUCLID and KETER. To account for this, we need to validate that the incoming label is one of those. We are a little more lenient and also accept labels that only start with one of those three labels.

Let us write tests first to define the desired behavior.

import pytest
from src.data.article_data import Article


@pytest.fixture
def article():
    return Article("SAFE", "", "", "")


@pytest.mark.parametrize("label", ["SAFE", "EUCLID", "KETER"])
def test_set_regular_label(article, label):
    article.label = label
    assert article.label == label
    article.label = label + "SOMETHING"
    assert article.label == label


def test_set_unknown_label(article):
    with pytest.raises(ValueError) as excinfo:
        article.label = "unknown"
    assert "unknown" in str(excinfo)

In the tests above, we are using a fixture that gives us an initialized Article instance. Then, we are defining the regular behavior of the setter (we are expecting the label to accept the three main object classes as well as labels that start with those) and what happens when the setter encounters an unknown label (we are expecting a ValueError, enforced via the raises helper).

Because we have not written any validation for the label attribute yet, the tests fail. To account for these kinds of validations, Python has @property decorators that allow for custom getter and setter methods.

class Article(object):

    ALLOWED_LABELS = ("SAFE", "EUCLID", "KETER")

    # __init__ omitted...

    @property
    def label(self):
        return self._label

    @label.setter
    def label(self, orig_label):
        labels = [
            label for label in self.ALLOWED_LABELS if orig_label.startswith(label)
        ]
        if not labels:
            raise ValueError(f"Unknown label '{orig_label}'!")
        self._label = labels.pop()

The Python interpreter calls the method decorated with @label.setter as soon as it encounters the line self.label = label in the __init__ method. As a result, code that uses this class has to deal with ValueErrors when constructing instances.

Adding a to_dict method

While the Article class is responsible for extracting information from the articles, it is much easier to use a simple dictionary when persisting extracted information. That is because the json library can serialize Python dictionaries directly; additionally, the pandas Data Science library is able to use dictionaries to construct their main object: a DataFrame. As a result, we need to write a to_dict method that turns an Article instance into a plain dictionary. Aside from the four attributes of the Article class, we also require the dictionary to contain the (character) lengths of the Procedures and the Descriptions as well as the Ratio of these two lengths.

def test_to_dict_trivial_article(article):
    d = article.to_dict()
    assert "Label" in d
    assert d["Label"] == "SAFE"
    assert "Name" in d
    assert "Procedures" in d
    assert "Description" in d
    assert "Procedures_Length" in d
    assert d["Procedures_Length"] == 0
    assert "Description_Length" in d
    assert d["Description_Length"] == 0
    assert "Procedures_Description_Ratio" in d
    assert d["Procedures_Description_Ratio"] == 0


def test_to_dict(article):
    article.name = "Test"
    article.procedures = "TestTest"
    article.desc = "Test"
    d = article.to_dict()
    assert "Label" in d
    assert d["Label"] == "SAFE"
    assert "Name" in d
    assert d["Name"] == "Test"
    assert "Procedures" in d
    assert d["Procedures"] == "TestTest"
    assert "Description" in d
    assert d["Description"] == "Test"
    assert "Procedures_Length" in d
    assert d["Procedures_Length"] == 8
    assert "Description_Length" in d
    assert d["Description_Length"] == 4
    assert "Procedures_Description_Ratio" in d
    assert d["Procedures_Description_Ratio"] == 2

The implementation is straightforward and uses a dictionary comprehension.

    def to_dict(self):
        return {
            "Label": self.label,
            "Name": self.name,
            "Procedures": self.procedures,
            "Description": self.desc,
            "Procedures_Length": len(self.procedures),
            "Description_Length": len(self.desc),
            "Procedures_Description_Ratio": len(self.procedures) / len(self.desc)
            if len(self.desc) > 0
            else 0,
        }

Using the Article class to process the txt files

Finally, we want to use the Article class to process the text files. More precisely, we would like to aggregate the articles into a pandas DataFrame. This object has a to_json method that allows us to persist it for later introspection.

First, let us write a test to pin down our expectations.

import pandas as pd
from click.testing import CliRunner
from src.data.make_dataset import main

TEST_DATA = {
    "scp-002.txt": """EUCLID\n
Item #: 002\n
Special Containment Procedures: Something something...\n
Description: Something else...\n
""",
    "scp-003.txt": """UNKNOWN\n
Item #: 003\n
Special Containment Procedures: Something something...\n
Description: Something else...\n
""",
}


def test_main():
    runner = CliRunner()
    with runner.isolated_filesystem():
        for filename, text in TEST_DATA.items():
            with open(filename, "w") as f:
                f.write(text)
        result = runner.invoke(main, [".", "."])
        assert result.exit_code == 0
        df = pd.read_json("data.json")
        assert len(df.index) == 1
        data = df.loc[0]
        assert "Label" in data
        assert data["Label"] == "EUCLID"
        assert "Name" in data
        assert data["Name"] == "Item #: 002"
        assert "Procedures" in data
        assert data["Procedures"] == "Something something..."
        assert "Description" in data
        assert data["Description"] == "Something else..."

Here, we are using the dictionary TEST_DATA to write two files with two mock articles. The first is a regular article with a valid object class, the second one is an article we do not wish to process. As a result, we expect that only one article is present in the processed data. Note that we are using pandas’ read_json method to obtain a DataFrame and, in turn, we are using DataFrame methods to assure that only one article is present and that the article data has been split up correctly.

To make this test pass we have to implement the following strategy:

  1. Start with an empty DataFrame.
  2. Parse each text file in the data/raw folder and turn it into an Article.
  3. Use Article’s to_dict method to append the data to the DataFrame.
  4. Persist the DataFrame to a json file in the data/processed folder.

Here’s a sketch implementation that uses the pathlib library’s glob function to iterate through the text files.

import pandas as pd
from pathlib import Path
from src.data.article_data import Article


df = pd.DataFrame({})
for file in Path("data/raw").glob("scp-*.txt"):
    with file.open() as f:
        article = Article.from_text(f.readlines())
    df = df.append(article.to_dict(), ignore_index=True)
df.to_json("data/processed/data.json")

From a software design perspective, this code leaves a lot to be desired. First, there are no log messages that will come in handy when things are off. Second, the paths are hard-coded and should be replaced by function parameters. Third and last, the Article’s classmethod from_text throws a ValueError each time it encounters an article with an unknown object class. We have to deal with this kind of situation without letting the entire script fail.

Here’s a revision of the sketch.

import click
import logging.config
import pandas as pd
from pathlib import Path
from src.data.article_data import Article


PROJECT_DIR = Path(__file__).resolve().parents[2]
logging.config.fileConfig(PROJECT_DIR / "logging_config.ini")
logger = logging.getLogger(__name__)


@click.command()
@click.argument("input_filepath", type=click.Path(exists=True))
@click.argument("output_filepath", type=click.Path())
def main(input_filepath, output_filepath):
    """ Runs data processing scripts to turn raw data from (../raw) into
        cleaned data ready to be analyzed (saved in ../processed).
    """
    logger.info("making final data set from raw data")
    df = pd.DataFrame({})
    for file in Path(input_filepath).glob("scp-*.txt"):
        logger.info("File: %s", file)
        with file.open() as f:
            try:
                article = Article.from_text(f.readlines())
            except ValueError as e:
                logger.warning("ValueError in file %s: %s", file, e)
                continue
        df = df.append(article.to_dict(), ignore_index=True)
    logger.info("DataFrame extracted. Writing to data.json in %s", output_filepath)
    df.to_json(Path(output_filepath) / "data.json")
    logger.info("Done.")


if __name__ == "__main__":
    main()

Note that we are emitting log warning messages whenever we encounter an unknown label but still continue with the processing.

Exercises

Just like in the last blog post, you can rapidly tackle an exercise by using git tags. For instance, if you want to tackle the first exercise, issue the command git tag ex-15 and start coding. If you want to compare your solution to mine, issue git diff sol-ex-15 when you have finished.

Git tag: ex-15   beginner

Add a __repr__ method to the Article class.

Git tag: ex-16   beginner

Add another test for the from_text method. The input shall be the same as in the test_from_text method except that you will leave out the name (“Some name ”). Consequently, assert that the name of the resulting Article instance will be an empty string.

Git tag: ex-17   intermediate

Unfortunately, there are some SCP articles that slightly diverge from the usual naming convention for their parts. For instance, SCP-524 has a Special Containment Procedure (singular!), SCP-2944 has Secure Containment Procedures, and SCP-931 consists of haiku paragraphs. While we could certainly be a little more thorough when parsing them, I will ignore them for the rest of this blog post series (I have encountered 130 warnings when parsing the first 3000 SCP articles which is less than 0,5% of incorrectly parsed articles). However, if you want to, feel free to optimize the parsing procedure. For starters, allow for the “Description” part to start with either “Description:” or “Summary:”. Do not forget to write tests!

Git tag: ex-18   intermediate

Raise RuntimeErrors whenever the Containment Procedures or the Description cannot be extracted. Catch these RuntimeErrors in make_dataset.py, log the error and continue with the for loop without adding the article to the DataFrame. Finally, add another test article in test_make_dataset.py with an unexpected beginning of the description and tests to test_article_data.py to make sure these RuntimeErrors are actually raised.

Quick analysis of the lengths

After we have extracted the (character) lengths of the two parts of the SCP articles, let us analyze them. We will use pandas to load the json file and compute some basic statistical measures.

Open jupyter lab (either by opening a terminal and issuing the command jupyter lab or by opening Anaconda, switching to the environment for the SCP project and open Jupyter Lab there), navigate to the notebooks folder of the SCP project and click the “+” icon above the folder breadcrumbs to fire up a new launcher.

scp2_jupyterlab_launcher.png

In the opening Launcher tab, choose a Python 3 Notebook. Now you are all set up to experiment with data interactively. The following is a transcript of my Jupyter notebook session.

Computing statistics of the lengths

We want to check that all the transformations we have done so far are sane so that we can work with a cleaned up dataset.

  import pandas as pd

  df = pd.read_json("../data/processed/data.json")
  df.head()
Table 1: Out[1]
  Description Description_Length Label Name Procedures Procedures_Description_Ratio Procedures_Length
0 SCP-1256 is a 24-page pamphlet entitled ’Bees … 1837 SAFE Item #: SCP-1256 Mobile Task Force Zeta-4 (’Beekeepers’) is cur… 0.224279 412
1 SCP-2987 is a modified MSI brand external hard… 2187 SAFE Item #: SCP-2987 SCP-2987 is to be kept on floor 17 of Site-88…. 0.203475 445
2 SCP-2039 collectively refers to two distinct f… 5399 EUCLID Item #: SCP-2039 Presently, Foundation efforts at Research Faci… 0.368772 1991
3 SCP-1530 is a two-story abandoned house locate… 3893 EUCLID Item #: SCP-1530 SCP-1530 is currently contained 120 meters fro… 0.201387 784
4 SCP-1524 is the sole remaining specimen of a s… 3211 EUCLID Item #: SCP-1524 Both of SCP-1524’s individual components are t… 0.530364 1703

Let’s look at some statistics of the extracted text lengths and the ratio.

  df.describe()
Table 2: Out[2]
  Description_Length Procedures_Description_Ratio Procedures_Length
count 2700.000000 2700.000000 2700.000000
mean 3208.542222 0.286840 777.595556
std 1658.345674 0.293568 519.808074
min 61.000000 0.000000 0.000000
25% 2104.750000 0.145726 414.750000
50% 2887.000000 0.229935 656.500000
75% 3957.000000 0.353646 994.250000
max 31618.000000 7.377049 7922.000000

Whereas count, mean, min and max are self-explanatory, std stands for standard deviation. The rows with percentages are the 25%-, 50%-, and 75%-quantiles, respectively. They were defined in my Blog post on means and medians. Here’s a short refresher: The 25%-quantile is a value such that 25% of the data is smaller than or equal to it and the other 75% of the data is greater than or equal to it. The 50%-quantile is also known as the median.

The minimum of 61 characters in Description_Length looks reasonable but a Containment Procedure with 0 characters? This has to be investigated. Before we do so, let us look at the same statistics but grouped by each label.

  df.groupby("Label").describe().stack()
Table 3: Out[3]
    Description_Length Procedures_Description_Ratio Procedures_Length
Label        
EUCLID count 1274.000000 1274.000000 1274.000000
  mean 3244.361852 0.308139 855.422292
  std 1701.660229 0.273383 529.896660
  min 428.000000 0.011165 148.000000
  25% 2179.250000 0.169438 497.250000
  50% 2935.500000 0.259065 727.000000
  75% 3977.750000 0.371186 1075.750000
  max 31618.000000 6.051948 7922.000000
KETER count 314.000000 314.000000 314.000000
  mean 3380.487261 0.401208 1128.343949
  std 1694.007237 0.328462 605.260134
  min 233.000000 0.000000 0.000000
  25% 2243.000000 0.218239 683.250000
  50% 3197.500000 0.332694 1028.000000
  75% 4192.250000 0.486212 1449.750000
  max 10141.000000 3.781726 3449.000000
SAFE count 1112.000000 1112.000000 1112.000000
  mean 3118.951439 0.230143 589.388489
  std 1592.721215 0.293088 392.807626
  min 61.000000 0.010626 64.000000
  25% 2003.000000 0.118879 326.000000
  50% 2791.500000 0.178353 488.500000
  75% 3860.750000 0.277565 730.500000
  max 12331.000000 7.377049 3680.000000

This is where it starts to get interesting! As safe SCPs are much easier to contain than euclid ones which in turn are easier to contain than keter SCPs, we expect that the Containment Procedures are easier to describe for safe ones and need more elaborate descriptions for keter ones. On average, this is reflected in the mean length of Containment Procedures (579 for safe, 833 for euclid and 1108 for keter).

Let us turn to the problematic cases of zero lengths.

  df.loc[(df["Procedures_Length"] == 0) | (df["Description_Length"] == 0)]
Table 4: Out[4]
  Description Description_Length Label Name Procedures Procedures_Description_Ratio Procedures_Length
1340 SCP-1994 is the general designation for a set … 1376 KETER Item #: SCP-1994   0.0 0

Thankfully, this is a single outlier. Investigating the article on the SCP Foundation web page and inspecting the html yields that the label “Special Containment Procedures” sits in its own p element so that we were not able to crawl this article correctly.

Let us ignore the outlier.

  df = df.loc[df["Procedures_Length"] > 0]

Finally, let us compute correlations between our features and the target. The correlation coefficient may be computed for number-valued random variables only. Fortunately, the nominal labels safe, euclid, and keter, carry ordinal information. That is to say, we can order them by their containment complexity. To make this even more explicit, let us assign numbers to the three labels. A safe label will be converted to -1, a euclid label to 0 and a keter label to 1 so that the order of the containment complexity is reflected by \(\mathrm{safe} < \mathrm{euclid} < \mathrm{keter}\). However, the magnitude of this conversion is still open for discussion. We could also have choosen \(10^{100}\) for keter and this would have influenced the correlation coefficients. But let’s stick to our simple way of converting for now.

  COMPLEXITY = {
      "SAFE": -1,
      "EUCLID": 0,
      "KETER": 1
  }

  def compute_complexity(label):
      return COMPLEXITY[label]

  df["Complexity"] = df["Label"].apply(compute_complexity)
  df.corr()
Table 5: Out[6]
  Description_Length Procedures_Description_Ratio Procedures_Length Complexity
Description_Length 1.000000 -0.293831 0.220675 0.052532
Procedures_Description_Ratio -0.293831 1.000000 0.577548 0.188953
Procedures_Length 0.220675 0.577548 1.000000 0.344329
Complexity 0.052532 0.188953 0.344329 1.000000

As it turns out, Complexity and Procedures_Length are positively correlated which is precisely what we have observed through the statistics that we have grouped by label. We also see that Description_Length is only very weakly correlated with Complexity: That is to say that there is no reason why, say, a safe SCP should not have a long description or why a keter SCP could not be described in a short manner.

Mathematical background behind TF-IDF

Before we get to enjoy the ease of use of sklearn’s Transformer API to apply the TF-IDF transformation, let’s try and get some understanding of it first. The easiest way to turn words into numbers is to count them. As simple as this sounds, this idea is the corner stone of the TF-IDF transformation.

Word count vectors

Let me make this more precise. Assume that we have articles \(A_1, \dotsc, A_n\). The vocabulary \(\mathcal{V} = \mathcal{V}_{A_1, \dotsc, A_n}\) is the set of unique words occurring in those articles. The set of all of our articles will also be called the document. For any article \(A\), the word count function is \[ \mathrm{wc}_{A}\colon \mathcal{V} \to \mathbb{N}, \, w \mapsto \text{Number of times \(w\) occurs in \(A\)}. \]

Once we have fixed the vocabulary, it is possible to turn the word count functions into word count vectors. First, we merely need to decide on an order of the vocabulary—the alphabetic ordering is a canonical choice! As soon as we have ordered the vocabulary, we may write it as \(\mathcal{V} = \{w_1, \dotsc, w_m\}\), where \(m\) is the total number of words in the vocabulary. Finally, we declare that the word count vector of the article \(A\) is \[ v_A = \bigl(\mathrm{wc}_A(w_1), \dotsc, \mathrm{wc}_A(w_m)\bigr). \]

Normalizing word counts

Instead of only talking about single words, we can more generally deal with terms. These can also be sequences of words that appear in our documents. Depending on how long those sequences are, they may carry information about the context that cannot be inferred from single words alone. Consequently, the word count vectors from before may more generally be called term count vectors (even though this does not seem to be standard language).

In general, vectors have the advantage of being comparable by using distance measures such as the Euclidian distance. Depending on the variety of articles and the precise application, however, there might be a problem. To make this concrete, let me illustrate this by a simple artificial example. Take the sentence “I am sure I am right about this” and transform it into a word count vector. Using alphabetic ordering, you should end up with the following.

word about am I right sure this
count 1 2 2 1 1 1

Let us add a second “article” to our document. The text consists of that same sentence twice. Thus, we obtain

word about am I right sure this
count 2 4 4 2 2 2

as its word count vector.

If you want to consider these two articles as similar or even the same, you will need to normalize the vectors. This means that before comparing two word count vectors you will divide them by their length. In this case, this approach will lead to the two articles being seen as the same. However, there are also reasons for wanting to tell these two articles apart: Even if they deal with the same point, the second one puts stronger emphasis on it by repetition.

To sum it up, depending on the application you might want to think about normalization of your word count vectors. In any case, the resulting word count vectors will be called term frequencies (even if you did not normalize) from now on. This concludes the first half of the TF-IDF transformation.

Inverse document frequency

Term frequency vectors suffer from another problem: Words that certainly occur in almost every English article such as “a”, “the”, and “is” but do not carry meaning will influence the similarity of articles. There are two ways to deal with this problem (in fact, they are often used in conjunction). The first is to simply ignore those words: Create a list of so-called stop words that will be ignored when building the vocabulary. The second way is to penalize words occurring in almost every article and boost rare words. This is precisely what inverse document frequency is doing.

Before we arrive at the precise definition, let us look at it from another angle. A word count is a measure that is local to a single article. This means that it does not depend on other articles in our document. If a single word count is high in that article, this might mean that this is an important word that potentially helps characterizing the article. However, if this word count is equally high in all the other articles then this word does not help us telling this article apart from the others (if everything is special then nothing is). Thus, there is the need for a trade-off between local measures (such as a single word count of a certain article) and global measures.

Inverse document frequency is such a global measure. To define it, let us concentrate on a single word \(w\) in our vocabulary. The document frequency \(\mathrm{df}(w)\) is the number of articles that \(w\) appears in. Consequently, the inverse document frequency is \(1/\mathrm{df}(w)\).

Now we are able to describe the TF-IDF transformation as a whole: For any article \(A\) in the document, multiply each word count in its word count vector with the inverse document frequency. In formulae: \[ \mathrm{TFIDF}(A) = \left(\frac{\mathrm{wc}_A(w_1)}{\mathrm{df}(w_1)}, \dotsc, \frac{\mathrm{wc}_A(w_m)}{\mathrm{df}(w_m)}\right). \]

Applying the TF-IDF transformation (Transcript of jupyter notebook session)

Before we apply the TF-IDF transformation, it is obligatory to put aside some test data for evaluating our model later. Otherwise, a future Machine Learning model would have access to statistics of the entire dataset and may deduce statistics of the test dataset afterwards. However, the entire purpose of the train-test-split is to evaluate the model on data it has not seen before.

  import pandas as pd

  df = pd.read_json("../data/processed/data.json")
  df = df.loc[df["Procedures_Length"] > 0, [
      "Label",
      "Procedures",
      "Description",
      "Procedures_Length",
      "Description_Length",
      "Procedures_Description_Ratio"
  ]]

Making a train-test-split

With sklearn, splitting a DataFrame reduces to calling the train_test_split function from the model_selection module. The test_size argument determines the relative size of the test set.

  from sklearn.model_selection import train_test_split

  X, y = df.drop(columns=["Label"]), df["Label"]
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Note that we split up our target column Label from the rest so that it will not be included in the following transformations.

Fitting TfidfVectorizers

Since we have two text columns (Procedures and Description), it is best to fit two =TfidfVectorizer=s so that all information contained in those two separately will be preserved. The rest of the features should be scaled as certain models encounter numerical problems when two features are on very different scales (that is to say one feature usually is very large, e.g. \(\gg 10^6\), while another only attains values between 0 and 1). To do all of this in one go, sklearn provides us with a ColumnTransformer that takes a list of tuples consisting of a column name and a transformer that should transform the corresponding column. Additionally, the ColumnTransformer’s remainder keyword argument may be another transformer that will be applied to the remaining columns. Here’s how to use it:

  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.compose import ColumnTransformer
  from sklearn.preprocessing import StandardScaler


  columnwise_tfidf = ColumnTransformer(
      [
          (
              "procedures",
              TfidfVectorizer(),
              "Procedures"
          ),
          (
              "desc",
              TfidfVectorizer(),
              "Description"
          )
      ],
      remainder=StandardScaler(),
      n_jobs=-1,
  )

First, the first item in the tuple is a name for the transformation for later reference. Second, the TfidfVectorizer with standard arguments constructs the TF-IDF vectors in almost the same way that I explained it in the Blog Post accompanying this part of the project. The only difference is that the document frequency of each word is increased by one to prevent zero divisions. Third and last, the StandardScaler scales the remaining features such that they have zero mean and unit standard deviation.

Applying this ColumnTransformer to our train set follows the usual sklearn API. Each Transformer has fit and transform methods. Here, the first is used solely on the train set to fit the Transformer. Afterwards, the second may be used to transform both the train and test set.

  columnwise_tfidf.fit(X_train)
  X_train_transformed = columnwise_tfidf.transform(X_train)

Conveniently, most transformers have a fit_transform method that combines these two steps into one:

  X_train_transformed = columnwise_tfidf.fit_transform(X_train)

Extracting keywords

Let us use the fitted transformers to extract keywords from articles. First, we will extract the vocabulary as determined by the =TfidfVectorizer=s. To distinguish between the words from the Procedures and the Description, we will prepend each of them with a prefix.

  def vocabulary():
      return (
          [f"proc__{name}" for name in columnwise_tfidf.named_transformers_["procedures"].get_feature_names()]
          + [f"desc__{name}" for name in columnwise_tfidf.named_transformers_["desc"].get_feature_names()]
      )

Note that the names we have provided for the =TfidfVectorizer=s earlier now come into play.

Second, let’s write a function accepting an article and returning a DataFrame containing the words with the highest frequencies.

  def extract_keywords(article, topn=10):
      article_transformed = columnwise_tfidf.transform(article).toarray()[0]
      frequencies = list(zip(vocabulary(), article_transformed))
      frequencies.sort(key=lambda x: -x[1])
      return pd.DataFrame(frequencies[:topn])

Finally, let’s extract keywords from one of the most iconic SCP articles: The one for SCP-682. This is one of the best examples of Keter class SCPs.

  scp_682 = df.loc[df["Description"].str.startswith("SCP-682")].drop(columns=["Label"])
  extract_keywords(scp_682)
Table 6: Out[8]
  0 1
0 proc__682 0.767357
1 desc__kia 0.738121
2 desc__682 0.523255
3 desc__agent 0.171312
4 desc__personnel 0.156161
5 proc__speak 0.153737
6 proc__acid 0.144138
7 proc__to 0.133515
8 desc__pvt 0.110179
9 proc__scp 0.107281

This does not look too promising. First, maybe numbers should be ignored. Then, there are words “to”, “of” appearing in almost every article in english. “speak” might also not be telling much. This will only get worse if we look at the top 30 keywords.

  extract_keywords(scp_682, topn=30)
Table 7: Out[9]
  0 1
0 proc__682 0.767357
1 desc__kia 0.738121
2 desc__682 0.523255
3 desc__agent 0.171312
4 desc__personnel 0.156161
5 proc__speak 0.153737
6 proc__acid 0.144138
7 proc__to 0.133515
8 desc__pvt 0.110179
9 proc__scp 0.107281
10 desc__handled 0.106319
11 proc__attempts 0.098297
12 proc__reacted 0.095920
13 desc__occurrence 0.095232
14 proc__incapacitation 0.091120
15 proc__of 0.090828
16 proc__fear 0.087715
17 proc__rage 0.087715
18 proc__hydrochloric 0.085073
19 proc__massive 0.085073
20 proc__frequent 0.082915
21 proc__provoking 0.082915
22 proc__breach 0.082463
23 desc__scp 0.081648
24 proc__should 0.080923
25 proc__lining 0.079510
26 proc__called 0.078116
27 proc__incapacitated 0.078116
28 proc__force 0.078011
29 proc__destroying 0.076869

Fine-tuning the TfidfVectorizer

Fortunately, TfidfVectorizer has a lot of options to fine-tune its behavior. First and maybe most importantly, we can enforce that certain words should be ignored via the stop_words keyword argument. It either expects the string “english” and then uses a list constructed by the sklearn developers (with its own set of disadvantages) or it expects a list of strings containing the words that shall be ignored. Second, we can specify a regex pattern via the token_pattern keyword argument. This pattern will be used when parsing the articles to build up the vocabulary. The standard pattern includes single words containing letters and numbers; we will modify it to only parse for words containing letters.

  columnwise_tfidf = ColumnTransformer(
      [
          (
              "procedures",
              TfidfVectorizer(
                  stop_words="english",
                  strip_accents='unicode',
                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
              ),
              "Procedures"
          ),
          (
              "desc",
              TfidfVectorizer(
                  stop_words="english",
                  strip_accents='unicode',
                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b'
              ),
              "Description"
          )
      ],
      remainder=StandardScaler()
  )

  columnwise_tfidf.fit(X_train)

Listing 1. Out[10].

  ColumnTransformer(n_jobs=None,
                    remainder=StandardScaler(copy=True, with_mean=True,
                                             with_std=True),
                    sparse_threshold=0.3, transformer_weights=None,
                    transformers=[('procedures',
                                   TfidfVectorizer(analyzer='word', binary=False,
                                                   decode_error='strict',
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1...
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1,
                                                   ngram_range=(1, 1), norm='l2',
                                                   preprocessor=None,
                                                   smooth_idf=True,
                                                   stop_words='english',
                                                   strip_accents='unicode',
                                                   sublinear_tf=False,
                                                   token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
                                                   tokenizer=None, use_idf=True,
                                                   vocabulary=None),
                                   'Description')],
                    verbose=False)

  extract_keywords(scp_682, topn=30)
Table 8: Out[11]
  0 1
0 desc__kia 0.890278
1 proc__speak 0.272335
2 proc__acid 0.255331
3 desc__agent 0.206627
4 proc__scp 0.190041
5 desc__personnel 0.188352
6 proc__attempts 0.174127
7 proc__reacted 0.169915
8 proc__incapacitation 0.161413
9 proc__fear 0.155381
10 proc__rage 0.155381
11 proc__hydrochloric 0.150702
12 proc__massive 0.150702
13 proc__frequent 0.146879
14 proc__provoking 0.146879
15 proc__breach 0.146078
16 proc__lining 0.140847
17 proc__called 0.138377
18 proc__incapacitated 0.138377
19 proc__force 0.138192
20 proc__destroying 0.136168
21 proc__containment 0.135959
22 desc__pvt 0.132891
23 proc__difficulty 0.132345
24 proc__submerged 0.132345
25 proc__best 0.130666
26 desc__handled 0.128236
27 proc__chamber 0.126861
28 proc__plate 0.125041
29 proc__development 0.123843

This looks much better. A few remarks:

  • I had to google for the two abbreviations “kia” and “pvt”. The first is the abbreviation for “killed in action” while the second stands for the military rank “Private”.
  • On second thought, “speak” may contain the information that the SCP object is able to speak and, thusly, might hint at it being sapient. As sapient SCPs are probably more likely to be of class euclid or keter, this could be valuable information for a Machine Learning model.
  • One could start building a custom list of stop words more suitable for parsing SCP articles. In the list above, the words “best” and “called” as well as “scp” could be ignored. I will postpone this to the next part of this series of posts. Because some models give some insight in their learning process, we can use them to see if their decisions are based on filler words.

Conclusion

In this blog post, we have learned how to use Jupyter Notebooks and the pandas library to extract basic statistics from SCP articles. Furthermore, we have used a basic TF-IDF transformation to extract keywords from SCP articles.

Effective Python 2nd Edition: What’s new?

Effective Python by Brett Slatkin is a book filled with best practices of the Python programming language. I devoured the first edition as an ebook and was eager to buy the second edition as a physical book. Having skimmed through it, I was already satisfied. The new edition features updates, the removal of all Python 2 specific hints and workarounds. More specifically, it concentrates on language features of Python 3 up to and including Python 3.8.

However, I was a little confused that there was no tabular information about the additions and the changes. Certainly, it looks like each item was changed in one way or another (as most items in the first edition contained workarounds for Python 2). But if you are like me, owning the first edition of the book and wondering if the new content will be worth it, this blog post got you covered.

In case you want to do the comparison yourself, there is a table of contents of the 2nd edition on the main page of the official web site of the book.

Entirely new items in Effective Python 2nd Edition

There are 27 entirely new items.

  • 4. Prefer Interpolated F-Strings Over C-style Format Strings and str.format
  • 6. Prefer Multiple Assignment Unpacking Over Indexing
  • 10. Prevent Repetition with Assignment Expressions
  • 13. Prefer Catch-All Unpacking Over Slicing
  • 14. Sort by Complex Criteria Using the key Parameter
  • 15. Be Cautious When Relying on dict Insertion Ordering
  • 16. Prefer get Over in and KeyError to Handle Missing Dictionary Keys
  • 17. Prefer defaultdict Over setdefault to Handle Missing Items in Internal State
  • 18. Know How to Construct Key-Dependent Default Values with __missing__
  • 19. Never Unpack More Than Three Variables When Functions Return Multiple Values
  • 29. Avoid Repeated Work in Comprehensions by Using Assignment Expressions
  • 33. Compose Multiple Generators with yield from
  • 34. Avoid Injecting Data into Generators with send
  • 35. Avoid Causing State Transitions in Generators with throw
  • 36. Consider itertools for Working with Iterators and Generators
  • 51. Prefer Class Decorators Over Metaclasses for Composable Class Extensions
  • 56. Know How to Recognize When Concurrency Is Necessary
  • 57. Avoid Creating New Thread Instances for On-demand Fan-out
  • 58. Understand How Using Queue for Concurrency Requires Refactoring
  • 59. Consider ThreadPoolExecutor When Threads Are Necessary for Concurrency
  • 61. Know How to Port Threaded I/O to asyncio
  • 62. Mix Threads and Coroutines to Ease the Transition to asyncio
  • 63. Avoid Blocking the asyncio Event Loop to Maximize Responsiveness
  • 74. Consider memoryview and bytearray for Zero-Copy Interactions with bytes
  • 79. Encapsulate Dependencies to Facilitate Mocking and Testing
  • 89. Consider warnings to Refactor and Migrate Usage
  • 90. Consider Static Analysis via typing to Obviate Bugs

Items heavily updated in Effective Python 2nd Edition

Most updates to existing items are to only cover code samples for Python 3.7 with some notable exceptions that show 3.8 exclusive samples (the most prominent being the introduction to the walrus operator in items 10 and 29). Most noteworthy, the first item Know Which Version of Python You’re Using already makes it clear at the end of the first paragraph:

This book does not cover Python 2.

Keeping this in mind, we expect to see some updates. For instance, Item 3: Know the Differences Between bytes and str does not mention the Python 2 exclusive unicode anymore. Even so, there are certain items that have been updated so much that they contain new advice because of newly added language features. In particular, I want to mention the following items in this regard.

  • 48. Validate Subclasses with __init_subclass__ (former Item 33: Validate Subclasses with Metaclasses)
  • 49. Register Class Existence with __init_subclass__ (former Item 34: Register Class Existence with Metaclasses)
  • 50. Annotate Class Attributes with __set_name__ (former Item 35: Annotate class attributes with Metaclasses)
  • 60. Achieve Highly Concurrent I/O with Coroutines (former Item 40: Consider Coroutines to Run Many Functions Concurrently)

While the first three items in this list introduce new language features restricting the use cases for Metaclasses, the last one was updated to show the new way of defining Coroutines using the asyncio built-in module. Since this last item and the preceding items 56-59 feature Conway’s Game of Life as an example, you might also argue that item 60 belongs in the next section.

Items that have been split up into multiple more elaborate items

There are two items from the first edition that have been split into multiple items in the 2nd edition:

  • Item 46: Use Built-In Algorithms and Data Structures has been split up into
    • 71. Prefer deque for Producer–Consumer Queues
    • 72. Consider Searching Sorted Sequences with bisect
    • 73. Know How to Use heapq for Priority Queues
  • Item 56: Test Everything with Unittest has been split up into
    • 76. Verify Related Behaviors in TestCase Subclasses
    • 77. Isolate Tests from Each Other with setUp, tearDown, setUpModule, and tearDownModule
    • 78. Use Mocks to Test Code with Complex Dependencies

Item 46 from the first edition had more of an overview character, introducing data types from built-in modules. The corresponding items from the 2nd edition are more elaborate and provide an in-depth view with more code samples and example use cases.

Considering the importance of testing generally and particularly in dynamically typed languages such as Python, I found item 56 from the first edition to be too brief. In contrast, the new items are much more elaborate on the best practices in testing. Above all, item 78 about Mocks is a precious addition, giving an example of how to use the unittest.mock built-in module to write unit tests for code depending on a database connection.

Conclusion

To sum it up, Effective Python 2nd Edition adds a lot of new content in comparison to its first edition. More precisely, there are 27 entirely new items. Additionally, two items have been split up into multiple, more elaborate items so that the new edition clocks in at 90 items in comparison to 59 items from the first edition.

Why I created my own fork of the Data Science Cookiecutter template

The Data Science Cookiecutter template is a great way to quickly set up your Data Science project. For instance, I have used and recommended it for my Machine Learning project as well as for a Data Analysis project at work. In this blog post, I want to emphasize four reasons why I created my own fork and will stop using the Data Science Cookiecutter template for future projects.

The reasons

The project repository is moving slowly

As of this writing, there have been 5 accepted commits in the master branch in 2019. Certainly, one could argue that this is due to the project being stable and close to being finished. In contrast, however, there are 30 open issues and 11 pull requests with a lot of discussion. In particular, there is an approved pull request that encompasses multiple feature requests. Even so, it has not been merged into master as of this writing and is open since March 2019.

The Data Science Cookiecutter template does not provide you with a test setup

Second, there is no test setup at the moment. There is an open pull request that suggests on adding a test folder parallel to the project folder.

The Data Science Cookiecutter template does not provide you with a choice of requirements management

Third, even though there is a requirements.txt in the Cookiecutter template with sensible defaults, it might not work on your system. For instance, I cannot install scikit-learn via pip. Instead, I have to rely on using conda. Unfortunately, the template does not provide me with an option to choose the package manager. Again, there is a lot of discussion in an open issue.

There are no pre-defined make targets for recurring tasks

Finally, there are tasks that you will deal with time and time again like splitting your dataset into a train and a test set, train a collection of models on the train set and, finally, evaluate them on the test set. Apart from the choice of which models to train and what kinds of metrics to use to evaluate them, these tasks are the same everytime. Consequently, they should be automated via make targets.

Alternatives to the Data Science Cookiecutter template

If you have read this far and have agreed with (some of) the reasons, you might wonder what alternatives to using the Data Science Cookiecutter templates there are. In fact, there are a lot: As of this writing, there are 943 forks of the project on github. I am particularly fond of the Cookiecutter EasyData template. It provides you with a rich setup of additional make targets as well as support for conda’s environment.yml. Furthermore, there is lots of example code for data transformations. As for the cons, I find the test setup too minimal. More precisely, the code supplied in the project folder is not tested. Instead, there is one single test file illustrating testing with python’s builtin unittest module. Plus, usage of the project template seems to be quite sophisticated and it is not well-documented enough. After the maintainers have finished the tutorial project, this might be a good choice. I’ll definitely keep an eye on this project!

After evaluating a few more templates, each with their own strengths and weaknesses, I have finally decided to fork the Data Science Cookiecutter template to add the functionality I need myself. I suggest that you do too: Think of all the Data Science projects you have done so far and answer the following question: What kind of functionality did you need in all of them? Then, build that functionality into the Data Science Cookiecutter template yourself. As already mentioned, there are lots of examples to gain inspiration from. Additionally, the process of building the template yourself and thinking about it may expose weaknesses and bottlenecks of your current workflow: You may realize that in all of your projects you have spent time on a task that can be automated via a make target!

To sum it up, building your own Data Science template over time with the Data Science Cookiecutter template as a starting point will get rid of its weaknesses and empower your own Data Science workflow. If you need some inspiration, check out the forks of the Data Science Cookiecutter template. For reference, here is my own fork: GriP on Data Science.

Quickfix: jupyter nbconvert with clear-output flag not working

Jupyter comes with a command line utility jupyter nbconvert that can transform a jupyter notebook into different formats. Furthermore, it has options to tweak the output. For instance, the execute flag executes the notebook before transforming it while the clear-output flag is supposed to remove outputs from the notebook. Thus, if you want to execute the notebook notebook.ipynb and transform it into a pdf file afterwards, you can issue the command

[sourcecode language=”text” title=”Listing 1. Example usage of nbconvert. ” ]
jupyter nbconvert notebook.ipynb –execute –to pdf
[/sourcecode]

I stumbled upon the following problem when I tried to create a make target for the SCP project. The make target should do the following: It should clear the output of a specified notebook, execute it and then export it as a pdf file.

Problem description

In contrast to its purpose, the clear-output flag does not remove the output from any notebook. Suppose your Jupyter notebook is notebook.ipynb. Then,

[sourcecode language=”text” title=”Listing 2. Does not work: Using the clear-output flag. ” ]
jupyter nbconvert –clear-output notebook.ipynb
[/sourcecode]

merely saves the file notebook.ipynb again. The output remains in the notebook.

Solution

Unfortunately, this still seems to be an open issue. However, there is a more specific version of the command available that does exactly what we want:

[sourcecode language=”text” title=”Listing 3. Works: Using more specific options. ” ]
jupyter nbconvert –ClearOutputPreprocessor.enabled=True –inplace notebook.ipynb
[/sourcecode]

It is not clear to me what the current status of the issue is; in fact, there are recent commits in other projects referencing the issue and the issue itself is labeled as upstream. Apparently, a dependency (traitlets) is causing this bug.

Classifying SCPs, Part 1: Building a web crawler

This first blog post in the series of our Data Science project of classifying SCPs is concerned with the starting point of any data-related problem, really: How do we obtain data? Fortunately, the data we want to build a Machine Learning model upon is readily available in the form of articles with standardized URLs. Thus, the task of obtaining the data comes down to writing a web crawler that parses the HTML underlying the articles and extracts the label (i.e., the Object Class) as well as the text for us.

In the following, I would like to describe my approach to this problem. Also, I will leave behind some exercises for you if you have not dealt with this kind of problem before and want to put some thought into it.

If you want to dive into the code right away and play around with the data I have created a github repository with the project code.

The post should be edible for beginning programmers that have basic knowledge of Python.

About exercises

As a mathematician, I think that you can only truly learn and understand a concept when you try it out yourself: Think of the simplest circumstance where the concept is applicable and work out the nitty-gritty details. In the process, you will have definitely learnt some of the pitfalls and specialities; others will only appear with time, when you apply the new concept to more complex environments. Likewise, I think that programming concepts can only be truly grasped when they are applied. The exercises I have created might help you with this. Some are easy refactorings solvable with standard Python constructs; others will require you to use another library and read (part of) its documentation. Both are invaluable skills as a programmer: Solving problems quickly with the programming language of your choice and trying to incorporate new libraries by reading their documentation.

How to solve the exercises

In case you want to do the exercises and quickly get to the point where you are able to tackle one, I suggest cloning the git repository and using the tags I have created precisely for this purpose. The exercise tags take the form ex-<number>, e.g. ex-1 for the first exercise. Thus, you can simply check out the commit tagged with the exercise you want to tackle and start coding. For instance, if you want to tackle the first exercise, git checkout ex-1 will get you there. After you’re done, you can compare your solution with mine by issuing git diff sol-ex-1.

Note that my solution is merely a suggestion. If you have found another one that you think might be more appropriate for the problem, feel free to leave a comment or open up an issue on github.

Also, I have provided difficulty tags (beginner, intermediate and expert) for the exercises. The beginner difficulty features exercises that hopefully will help you learn programming language features; this may consist of reading about a programming construct in the python docs and changing a keyword argument in a function. Intermediate difficulty signifies that you need to read about a certain feature in a library before being able to solve them. Finally, expert level exercises will require even more reading about library features as well as use some advanced concepts that cannot be fully explained in the text (this blog post contains one expert exercise that requires you to do some research about mocking in Python).

Do not hesitate to try out intermediate or expert level exercises even if you still feel like a beginner. Even if you are not able to solve them completely, there is much to learn from them.

Setting up our Data Science project

Before we start this project proper we first have to lay out our project structure. As announced in the overview blog post, we will use a Cookiecutter template for this purpose. First things first: If you have not installed cookiecutter yet, a simple

[sourcecode language=”bash” title=”Listing 1. Installing cookiecutter. ” ]
pip install cookiecutter
[/sourcecode]

will do. It is a good idea to install cookiecutter globally. After installing cookiecutter, we will use the Data Science cookiecutter template by simply issuing the command

[sourcecode language=”text” title=”Listing 2. Using cookiecutter. ” ]
cookiecutter https://github.com/drivendata/cookiecutter-data-science
[/sourcecode]

You will be asked a few questions about the project. If you’re not sure how to answer, hitting enter will provide a sensible default (for instance, we don’t care about setting up S3 for now).

[sourcecode language=”text” title=”Listing 3. Cookiecutter Q&A. ” ]
project_name [project_name]: SCP
repo_name [scp]:
author_name [Your name (or your organization/company/team)]: Paul Grillenberger
description [A short description of the project.]: Classifying SCP articles
Select open_source_license: 1 – MIT 2 – BSD-3-Clause 3 – No license file Choose from 1, 2, 3 (1, 2, 3) [1]:
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include ‘s3://’)]:
aws_profile [default]:
Select python_interpreter: 1 – python3 2 – python Choose from 1, 2 (1, 2) [1]:
[/sourcecode]

After this, a folder with the project data files has been created. It is good practise to put it under version control and create an initial commit immediately, for example via

[sourcecode language=”text” title=”Listing 4. Initializing git repository. ” ]
git init && git add -A && git commit -m “Initial Commit”
[/sourcecode]

To deal with dependencies for our project, we need to create a new conda environment. Fortunately, the provided makefile works with conda out of the box! Simply issue make create_environment to, you guessed it, create an environment. Afterwards, you need to use conda activate <environment> to switch to the freshly created environment. Now, to install the requirements, a simple make requirements will do (note that I’ve added some additional requirements in the repository so be sure to add those as well if you’re starting from scratch). Now we are all set up.

Our main points of interest in this folder structure will be the folders src/data/ (for data-related source code) and data/ (where the actual raw and processed data will be placed). Explore the repository and read the READ.md to get a feeling for the structure. When you’re ready, we can start to work on our project.

Rough introspection

To get an idea of what we are going to crawl on, let us take a look at the SCP Foundation Website in our browser. We are interested in the first one thousand SCPs, so we take a look at the first, say, five to get a rough idea on what each site looks like.

List of all SCP articles
List of all SCP articles

I know that number one is a special SCP that strays far from the format. The other four look similar in terms of their format, though.

Article of SCP-001
The article of SCP-001 is actually a
collection of articles that are competing suggestions
An example of a regular SCP article for our web crawler
An example of a regular SCP article, in
this case SCP-005

It’s best to skip SCP-001 because of its exceptional nature. For the others, we will take a deeper look at the HTML structure now (because that’s what the web crawler libraries see).

Detailed introspection

For my browser chrome on my mac, I hit Command-Alt-I to fire up the web developer console. Switching to the tab “Elements” yields the HTML source code of the current page. Hovering over a line of code shows what it corresponds to on the rendered browser page. Using this, we quickly find out that the content is inside a div element with the id page-content. However, most of its children are wrapped up in p elements with no distinguishing attributes. A typical layout seems to look like this:

[sourcecode language=”html” title=”Listing 5. Typical SCP layout. ” ]

SCP-005.jpg

A close up of SCP-005

Item #:

Object Class:

Special Containment Procedures:

Description:

Addendum / Additional notes / …:

[/sourcecode]

Writing the web crawler

The detailed introspection suggests the following approach: For each page, find all direct child p elements. Then, get rid of the HTML. The line starting with “Object Class” contains the target label. The following text contains the data we want to predict upon.

Constructing the URL from the SCP number

Let’s say that we want to crawl the article text from each SCP whose numbers are between 2 and 2500. Then the first task is to write a small function accepting a number and spitting out the URL of the corresponding SCP article. Taking a look at the URLs for SCPs #1, #11, #111, and #1111, we see that the URL format is

[sourcecode language=”text” title=”Listing 6. URL format for SCP articles. ” ]
http://scp-wiki.net/scp-
[/sourcecode]

where the number is filled up with leading zeros so that it takes up at least 3 spaces. Because I like to proceed test-driven, I create two files in the folder src/data/: A file webcrawl.py for the actual source code and a file test_webcrawl.py for tests. In webcrawl.py, let us create a prototype of our function:

[sourcecode language=”python” title=”Listing 7. webcrawl.py. ” ]
def construct_url(scp_number):
pass
[/sourcecode]

In test_webcrawl.py, we create a prototype test to get us started:

[sourcecode language=”python” title=”Listing 8. test_webcrawl.py. ” ]
from .webcrawl import construct_url

def test_construct_url():
assert False
[/sourcecode]

From the command line, issue the command pytest. As expected, pytest complains that one of our tests fails (in this case, of course, for trivial reasons):

[sourcecode language=”text” title=”Listing 9. pytest complaining. ” ]
def test_construct_url():
> assert False
E assert False

src/data/test_webcrawl.py:5: AssertionError

src/data/test_webcrawl.py ⨯ 100% ██████████

Results (0.14s):
1 failed
– src/data/test_webcrawl.py:4 test_construct_url
[/sourcecode]

Okay, this means that our setup works. Now let us put some real assertions depending on our yet-to-write functions in there:

[sourcecode language=”python” title=”Listing 10. Our first real test. ” ]
def test_construct_url():
assert “http://scp-wiki.net/scp-001” == construct_url(1)
assert “http://scp-wiki.net/scp-011” == construct_url(11)
assert “http://scp-wiki.net/scp-111” == construct_url(111)
assert “http://scp-wiki.net/scp-1111” == construct_url(1111)
[/sourcecode]

This time, pytest complains because our function does not do what we expect it to do yet:

[sourcecode language=”text” title=”Listing 11. Our primary target: Get pytest to shut up. ” ]
def test_construct_url():
> assert “http://scp-wiki.net/scp-001” == construct_url(1)
E AssertionError: assert ‘http://scp-wiki.net/scp-001’ == None
E + where None = construct_url(1)

src/data/test_webcrawl.py:5: AssertionError

src/data/test_webcrawl.py ⨯ 100% ██████████

Results (0.09s):
1 failed
– src/data/test_webcrawl.py:4 test_construct_url
[/sourcecode]

In test-driven development, this means we are in phase “RED” now: We have written a test that tells us exactly when we have established our desired functionality. Our target is to get to phase “GREEN” as quickly as possible. That means we can finally write some code. To fill up a given integer with zeros to at most three spaces, we can use elementary python String formatting:

[sourcecode language=”python” title=”Listing 12. This shuts pytest up. ” ]
BASE_URL = “http://scp-wiki.net/”
SCP_ROUTE_TEMPLATE = “scp-{number:03d}”

def construct_url(scp_number):
return BASE_URL + SCP_ROUTE_TEMPLATE.format(number=scp_number)
[/sourcecode]

Running pytest afterwards tells us that our one test has passed. We are in phase “GREEN” now. We can now safely refactor our code until we are satisfied with it. Whenever we make changes and let the tests run, we can be confident that our code still works as expected. Sometimes, this is called the “REFACTOR” phase of TDD. I will leave this phase to you in the following exercises.

Exercises

  • Git Tag: ex-1   beginner

    Get rid of the global variables BASE_URL and SCP_ROUTE_TEMPLATE and use f-Strings to refactor construct_url. Be sure to let the tests run afterwards to see if you still get the desired outcome.

  • Git Tag: ex-2   beginner intermediate

    In my opinion, it is perfectly fine to violate the DRY (Don’t repeat yourself) principle when writing tests to keep them simple. However, pytest provides some decorators that help us generate test cases when we simply want to check on certain function outputs with varying inputs. Use the pytest.mark.parametrize decorator to basically turn our test into a one-liner.

Filtering for the page content

After having constructed the URL, the logical next step would be to use it and request the data from the server. Fortunately, the requests library solves this issue for us. A simple call to requests.get will do. Even so, we do not need every information that a call to requests.get returns (we do not need header data from the response, we do not need the html header data…). Thus, our task will be to use the BeautifulSoup library to filter everything within the div element with the id “page-content”. To test if we obtain the correct data, let us first write a main function that will serve as the entry point to our script.

[sourcecode language=”python” title=”Listing 13. webcrawl.py. ” ]
import argparse
import requests

# construct_url as before…

def crawl_for(scp_number):
url = construct_url(scp_number)
response = requests.get(url)
content = response.text
return content

if __name__ == “__main__”:
parser = argparse.ArgumentParser()
parser.add_argument(
“–number”,
type=int,
dest=”scp_number”,
default=2,
help=”Number of the SCP article to obtain.”,
)
args = parser.parse_args()
print(crawl_for(args.scp_number))
[/sourcecode]

A call to the get function of the requests library returns a Response object whose content can be retrieved via the text attribute. If the script is called via the command line, we use the argparse module to parse command line arguments. In this case, we accept an optional argument --number that defaults to 2. If you call the webcrawl.py from the commandline now, the whole HTML from SCP #2 should be printed out. However, as mentioned in the introduction, we are only interested in the children of a certain div element.

To go on in a test-driven manner, we wrap a prototype function around the response.text and write a test for it.

[sourcecode language=”python” title=”Listing 14. webcrawl.py. ” ]
def filter_for_page_content(page):
pass

def crawl_for(scp_number):
url = construct_url(scp_number)
response = requests.get(url)
content = filter_for_page_content(response.text)
return content
[/sourcecode]
[sourcecode language=”python” title=”Listing 15. test_webcrawl.py. ” ]
from .webcrawl import construct_url, obtain_raw_content

TEST_PAGE = “””


Some scp

Some paragraph.


“””

# test_construct_url omitted…

def test_filter_for_page_content():
expected = “””

Some paragraph.

“””.strip()
actual = str(filter_for_page_content(TEST_PAGE))
assert expected == actual
[/sourcecode]

Of course, this is a only basic test but sufficient for our purposes. More concretely, we want to make sure that the function extracts precisely the content from the HTML that we care about, namely the div element with the id page-content and its children. Because we have not written any code yet, pytest should signal that we are in phase RED again. Now BeautifulSoup enters the picture. The main entry point to web crawling is the BeautifulSoup object of the bs4 module. Its constructor takes the raw content. The resulting instance has a find method that can be used to find the first element with a certain name and attributes – and that’s precisely the functionality we need! The function implementation comes down to this:

[sourcecode language=”python” title=”Listing 16. webcrawl.py. ” ]
# Other imports…
from bs4 import BeautifulSoup

# construct_url omitted…

def filter_for_page_content(page):
return BeautifulSoup(page).find(name=”div”, attrs={“id”: “page-content”})
[/sourcecode]

Our tests should pass again. However, pytest gives us a few warnings that we will deal with in the exercises.

Exercises

  • Git Tag: ex-3   beginner

    Add the features keyword argument to the constructor of the BeautifulSoup object and assign the value "html.parser" to it to shut down the warnings. Read the doc-string of the BeautifulSoup object to find out why this may be useful. Note that you might still encounter another warning concerning the import of “ABCs” from collections instead of importing collections.abcs. At the time of this writing, this seems to be an issue with the BeautifulSoup library itself that we can do nothing about.

  • Git Tag: ex-4   intermediate

    Use the click library instead of argparse to parse the command line arguments. In the repository, the file src/data/make_dataset.py contains a good template if you get stuck. Note that you may have to move the print statement to the crawl_for function and use the echo() function instead.

Splitting the filtered content into the label and the text

After having completed two straightforward tasks, let us come to the heart of our problem. We have extracted the main part of an SCP article and want to split it into the object class of the underlying SCP and the article text. Before we think about a solution to this problem, let us implement a prototype function.

[sourcecode language=”python” title=”Listing 17. webcrawl.py. ” ]
def split_into_label_and_text(raw_text):
pass
[/sourcecode]

Then, let us write a test. Because this might not be straightforward, let me spell out my thoughts here. The typical input of the split_into_label_and_text function is a BeautifulSoup object containing all children of the div element with the id page-content. In particular, this BeautifulSoup object might contain a div element containing an image and it contains a div element containing the footer with the links to the previous and the next SCP article. What I want the function to do is the following:

  1. It should return a tuple. The first element should be the label (i.e. the object class), the others should be the p elements containing the Object number, the Special Containment Procedures, the description, and, if present, any addenda.
  2. The label should be in upper case.
  3. The image and the footer should not be returned by the function.

    Having worked out these requirements, a simple test case is not hard to pull off. We can use the typical SCP HTML structure that we have worked out in the detailed introspection as a template and boil it down a little. Here’s what I came up with.

    [sourcecode language=”python” title=”Listing 18. test_webcrawl.py. ” ]
    from bs4 import BeautifulSoup
    # The other imports and function remain untouched…

    def test_split():
    test_content = BeautifulSoup(
    “””

    Some caption

    Item #: SCP-xxx

    Object Class: Safe

    Special Containment Procedures:

    Description:

    Other…

    “””,
    features=”html.parser”,
    )
    actual_label, actual_content = split_into_label_and_text(test_content)
    expected_label = “SAFE”
    expected_content = [

    Item #: SCP-xxx

    “,

    Special Containment Procedures:

    “,

    Description:

    “,

    Other…

    “,
    ]
    assert expected_label == actual_label
    assert expected_content == [str(p) for p in actual_content]
    [/sourcecode]

    Note that the test_content contains both a div element containing an image and another div element mocking footer data. As you can see in the list expected_content, I do not want these to be returned by the function. As is expected, this test will fail, simply because the returned None value cannot be split into an actual_label and an actual_content.

    Unfortunately, BeautifulSoup cannot help us directly to implement this function because the object class is inside a p element without any distinguishing properties. The only safe way to obtain it is to search the text for the first occurrence of the string “Object Class”. Here’s my implementation.

    [sourcecode language=”python” title=”Listing 19. webcrawl.py. ” ]
    def split_into_label_and_text(raw_text):
    paragraphs = raw_text.find_all(“p”)
    obj_class_p = next(p for p in paragraphs if “Object Class” in p.get_text())
    paragraphs.remove(obj_class_p)
    label = obj_class_p.contents[-1].strip().upper()
    return label, paragraphs
    [/sourcecode]

    A lot is happening in those five lines so let me guide you through them step by step.

    1. We use the find_all method to obtain a list of all p elements.
    2. The expression p for p in paragraphs is a generator expression that lazily gives us the elements of the paragraphs list that satisfy the condition if "Object Class" in p.get_text(). The built-in function next() evaluates the generator once and thusly gives us the first p element containing the string “Object Class”.
    3. We remove the p element containing the object class from the list.
    4. Finally, to obtain the label transformed to uppercase, we use the contents attribute that is a list of the children of the BeautifulSoup object to obtain the last element (index -1). Because the string “Object Class” itself is contained in a strong element, this will give us the label. The strip and upper methods are built-in methods of the string class.
    5. We return a tuple.

    This implementation still lets the tests fail. The reason is that we return all p elements as the second tuple element, even the mocked image caption and the footer data. The solution is to only look for the direct children that are p elements. This will be implemented in the exercises.

Exercises

  • Git Tag: ex-5   beginner

    Use the recursive argument of the find_all method to let the tests pass.

  • Git Tag: ex-6   beginner

    Update the crawl_for method so that it uses the freshly-implemented split_into_label_and_text function and print out the label and the paragraphs.

Writing the results to a text file

After we have obtained the label and the text of the article we have to merely persist this data so that it can be analyzed and transformed later. The easiest way is to write each article to a text file, where the first line would be the label.

Additionally, we will add a new click.argument to our command line script that allows us to submit a file path where the articles should be saved. If you have not done the refactoring exercises yet the following code samples will contain spoilers.

Here’s how it could go.

[sourcecode language=”python” title=”Listing 20. webcrawl.py. ” ]
@click.command()
@click.argument(“filepath”, type=click.Path(exists=True))
@click.option(“–number”, default=2, help=”Number of the SCP article to obtain.”)
def crawl_for(number, filepath):
url = construct_url(number)
response = requests.get(url)
content = filter_for_page_content(response.text)
label, paragraphs = split_into_label_and_text(content)
with open(os.path.join(filepath, f”scp-{number:03d}”), “w”) as f:
f.write(label + “\n”)
for paragraph in paragraphs:
f.write(paragraph.get_text() + “\n”)
[/sourcecode]

Now everything falls into place. If you are in the root directory of the project repository, a simple python src/data/webcrawl.py data/raw/ will write the contents of the article about SCP-002 into the text file data/raw/scp-002.txt. Because we do not want to type in this command a thousand times, it remains to refactor the crawl_for function to accept a range of numbers of SCP articles whose contents should be crawled.

Exercises

If you are a fan of test-driven development, you might have wondered why I simply added the last lines of code without providing a test. Also, you might wonder why there are no tests for the crawl_for function. The reason is that both depend on external resources or libraries (the last lines of code depend on a writable directory and the crawl_for function uses the requests library to fetch data from the internet). There are solutions for these kinds of problems but they might distract a little from the main task so that I have decided to put them into exercises (ex-8 and ex-9).

  • Git Tag: ex-7   beginner

    Remove the --number option and supply two options --lower and --upper. These should be the lower and upper bounds of the numbers of SCP articles the command line script should fetch. Remember to provide sensible default values as well as a help text.

  • Git Tag: ex-8   intermediate

    Refactor the last three lines of the crawl_for function into another function that accepts a file object (in the crawl_for function, this is the f variable). Test this function by providing an appropriately prepared StringIO object.

  • Git Tag: ex-9   expert

    Add a test for the crawl_for function by substituting the calls of the get function of the requests library and the open built-in function with appropriate Mock objects. Here are a few pointers:

A short digress on logging

If you let the web crawler run with default arguments (I chose to crawl every SCP between 2 and 1000) the script will fail miserably after a few seconds.

[sourcecode language=”text” title=”Listing 21. An ungraceful exit of a program: An exception. ” ]
~/Data Science Projects/scp [tags/sol-ex-9] λ python src/data/webcrawl.py data/raw/
Traceback (most recent call last):
File “src/data/webcrawl.py”, line 47, in
crawl_for()
File “/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py”, line 764, in __call__
return self.main(*args, **kwargs)
File “/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py”, line 717, in main
rv = self.invoke(ctx)
File “/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py”, line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py”, line 555, in invoke
return callback(*args, **kwargs)
File “src/data/webcrawl.py”, line 41, in crawl_for
label, paragraphs = split_into_label_and_text(content)
File “src/data/webcrawl.py”, line 20, in split_into_label_and_text
obj_class_p = next(p for p in paragraphs if “Object Class” in p.get_text())
StopIteration
[/sourcecode]

Apparently, there is a problem with finding the Object Class but it is not immediately apparent what that problem could be. Looking into the folder data/raw, we see that only seven SCP articles have been written to text files. Thus, the problem occurred when crawling the eighth article. Taking a look at the article page of SCP-008, we see that it is behind a javascript wall; the article itself is only generated when a certain link is followed. Thus, our program stops working because it cannot find the p element with the Object Class in it.

Our web crawler has problems with this javascript link
The article of SCP-008 sits behind this javascript link
The proper article of SCP-008 after clicking the javascript link
Hitting the javascript link reveals the proper article

This is an example of a problem that occurs in programming all the time. You have a certain expectation and build your program with that expectation in mind; however, your expectation differs ever-so-slightly from reality and all of a sudden your program stops working. Tests will help you make sure that your program works according to your expectations but they won’t help you when your expectations are off. Additionally, analyzing why your program stopped working can become a huge hassle.

Fortunately, there is a general tool that will help you make your program much more communicative and your analysis much easier: Logging.

Test for the expected, log for the unexpected.

Logging not only helps you in cases where something is wrong. It may also help you make the user experience a little better. From the very moment your script starts the user does not get any feedback until an exception is thrown. However, your user might be interested in what SCP article the program is currently dealing with because they might want to know how long it takes until it is finished.

A short introduction to the python logging module

The builtin logging module helps us with both needs. A python logger has three vital parts:

  1. A name. This can be used to configure a logger from within a configuration file. Usually, a logger gets its module it is constructed in as its name. This implies a hierarchical structure: If you wish to activate loggers for a certain module, the loggers in submodules get activated as well – assuming they adhere to this naming convention.
  2. A Logging Level. This tells you what kind of messages the logger will filter out and what messages it lets through. In Python, there are the levels critical, error, warning, info, debug, and notset, sorted from very specific about the messages it lets through to letting through every message. Usually, you will use the debug level if you want to leave behind some breadcrumbs for, you guessed it, debugging to yourself in the future or some colleague, the info level for general information about where your program is and what its state is, the warning level for logging program states that are unusual and might hint at a problem arising in the near future and the error and critical level for exceptions and critical conditions that will hinder your program from working successfully.
  3. One or more handlers. A handler defines how to deal with logging messages – should they be printed out to the console, or should they be written to a log file? A handler has its own logging level and thus is able to filter out specific logging messages. Also, it defines a logging format (through a formatter) that can give additional information.

We will define a logger with the module name (the magic name attribute) and three handlers: One handler (console_handler) that logs info messages to the console (thus, the user will at least experience where the program is at), and two FileHandlers that log debug messages and warnings to two separate files. The warning log will help us quickly identify that something went wrong while the debug log will give us detailed information why.

[sourcecode language=”python” title=”Listing 22. Defining a logger and logging handlers. ” ]
import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
std_formatter = logging.Formatter(
“[%(asctime)s][%(levelname)-5s][%(threadName)s] – %(message)s”
)
warn_file_handler = logging.FileHandler(“warnings.log”)
warn_file_handler.setFormatter(std_formatter)
warn_file_handler.setLevel(logging.WARN)
debug_file_handler = logging.FileHandler(“debug.log”)
debug_file_handler.setFormatter(std_formatter)
debug_file_handler.setLevel(logging.DEBUG)
console_handler = logging.StreamHandler()
console_handler.setFormatter(std_formatter)
console_handler.setLevel(logging.INFO)
logger.addHandler(console_handler)
logger.addHandler(warn_file_handler)
logger.addHandler(debug_file_handler)
[/sourcecode]

The logger definition itself is pretty useless until we emit logging messages. I have decided to put a few info messages at the beginning and the end of our script as well as some debug messages that log the content of intermediate results. Furthermore, I added a warning message if the label we have identified is not one of SAFE, EUCLID or KETER.

[sourcecode language=”python” title=”Listing 23. Emitting log messages. ” ]
# click decorators omitted…
def crawl_for(lower, upper, filepath):
logger.debug(
“Called with lower = %s, upper = %s, filepath = %s”, lower, upper, filepath
)
for number in range(lower, upper):
logger.info(“Crawling number %d”, number)
url = construct_url(number)
logger.debug(“URL: %s”, url)
response = requests.get(url)
logger.debug(“Response: %s”, response.text)
content = filter_for_page_content(response.text)
logger.debug(“Content: %s”, content)
label, paragraphs = split_into_label_and_text(content)
logger.info(“Identified label %s”, label)
logger.debug(“Paragraphs: %s”, paragraphs)
if label not in (“SAFE”, “EUCLID”, “KETER”):
logger.warn(“Unknown label %s for number %d”, label, number)
with open(os.path.join(filepath, f”scp-{number:03d}.txt”), “w”) as f:
write_to(f, label, paragraphs)

if __name__ == “__main__”:
logger.info(“Start webcrawling…”)
crawl_for()
logger.info(“End webcrawling…”)
[/sourcecode]

Letting the script run now, we get a nice status message for each number and immediately see that something fails when crawling the article for SCP-008:

[sourcecode language=”text” title=”Listing 24. Info messages help narrowing down bug causes immediately. ” ]
[2019-11-13 15:03:19,796][INFO ][MainThread] – Crawling number 7
[2019-11-13 15:03:20,398][INFO ][MainThread] – Identified label EUCLID
[2019-11-13 15:03:20,398][INFO ][MainThread] – Crawling number 8
Traceback (most recent call last):
File “src/data/webcrawl.py”, line 78, in
crawl_for()
(Rest of traceback omitted)
[/sourcecode]

For the moment, I will simply catch the exception, emit an error logging message and continue the for loop.

[sourcecode language=”python” title=”Listing 25. Catching the exception and sending it to the warnings log. ” ]
# Outer for loop ommitted…
logger.debug(“Content: %s”, content)
try:
label, paragraphs = split_into_label_and_text(content)
except Exception:
logger.exception(“Exception when splitting for number %d”, number)
continue
logger.info(“Identified label %s”, label)
[/sourcecode]

As a closing remark for this section, I would like to mention that logging ultimately comes down to personal preference, be it your own or that of your team. Arguably, a plethora of logging calls may pollute your code and obfuscate the meaning. It can be hard to find the right balance – that only comes with experience.

Exercises

  • Git Tag: ex-10   beginner

    Add a debug message to the split_into_label_and_text function that logs the content of the paragraphs variable.

  • Git Tag: ex-11   intermediate expert

    While the logging configuration is necessary, it also pollutes the program: A reader of the source code has to wade through 15 lines of code detailing the logging configuration when they might simply want to find out how the internal logic of the program works. Therefore, use the fileConfig to put the logging configuration into a file. Here are a few hints that help you avoid some pitfalls I stumbled upon when building my configuration file:

    • First, the section on the file format in the docs contains a few examples that should help you on your journey.
    • Second, the script will be called from the root directory of the project. As a result, it is feasible to also put the logging config into the root directory.
    • Third, since we will be calling the script directly via the command line, the __name__ attribute will equal "__main__". I suggest configuring the root logger with the three handlers and a debug log level and one additional logger as follows.

      [sourcecode language=”text” title=”Listing 26. Definition of a __main__ logger. ” ]
      [logger_main]
      level=DEBUG
      propagate=1
      handlers=
      qualname=__main__
      [/sourcecode]

      The flag propagate will emit every log message to parent loggers. Because the root logger, as its name suggests, is a parent of every logger the handlers of the root logger will deal with the log messages emitted by the main logger – even though the main logger does not define any handler itself.

    • Finally, it is possible that we will modify the logging config in the future. Also, your logging preferences might differ from mine. Consequently, I only committed a file logging_config_template.ini and added it to version control and put the logging_config.ini into .gitignore.

    Note that it is also possible to configure logging with a yaml file, parse it with the library pyyaml and feed the resulting dictionary into dictConfig. The plain fileConfig is older and does not support every feature that dictConfig does so that this seems to be the new best practice for configuring logging via a file.

Speeding it up with threading

We could let the script run now and it would fetch the first one thousand SCP articles for us; however, it will take some time. On my machine, each crawl for a single SCP article takes about 600 ms. That is to say, a thousand crawls will take about 600 seconds which is 10 minutes. Analysing the timestamps in the debug.log, it seems that most time is spent waiting for the GET request to deliver the data.

Multi-threading and multi-processing in Python

Coming from JAVA, a language that naturally supports concurrency, I was surprised to learn that Python distinguishes between multi-threading and multi-processing. This is due to the global interpreter lock that assures “that only one thread executes Python bytecode at a time.” To clarify, multi-threading in Python refers to concurrent programming in a single processor while multi-processing distributes over different processor kernels. As a rule of thumb, multi-threading is useful when you want to make IO-heavy tasks (such as waiting for request responses and reading from or writing to files) concurrent. For computation-heavy tasks (such as solving equations, training Machine Learning models…), stick to multi-processing.

Implementing multi-threading via ThreadPoolExecutor

Using the tools in the concurrent.futures module, turning a for loop concurrent can be done using an easy-to-follow pattern. I would like to call it the Concurrency Refactoring Pattern or CRP for short.

  1. Refactor everything in the for loop into a function accepting the variable that is iterated over as its argument.
  2. Replace the for loop with a with statement initialising a ThreadPoolExecutor.
  3. Replace the call of the new function with a call to the map method of the initialised executor with the new function and the iterable that was iterated over in the for loop as its arguments.

To make this pattern clearer, here are code samples illustrating the CRP.

[sourcecode language=”python” title=”Listing 27. A for loop after the first step of the CRP. ” ]
for x in it:
do_something(x)
[/sourcecode]
[sourcecode language=”python” title=”Listing 28. Implementing the second step of the CRP. ” ]
with ThreadPoolExecutor(max_workers=64) as executor:
do_something(x)
[/sourcecode]
[sourcecode language=”python” title=”Listing 29. Final step of the CRP. ” ]
with ThreadPoolExecutor(max_workers=64) as executor:
executor.map(do_something, it)
[/sourcecode]

Even though the amount of code does not differ that much, a lot is happening in these two lines. First, the ThreadPoolExecutor is initialised with a maximum of 64 workers (threads). Think of this ThreadPoolExecutor as a manager that gives the workers something to do. In addition, it manages the case where there is not enough work to do for the amount of workers that we requested (imagine the case when we only want to obtain 60 SCP articles but we initialised the ThreadPoolExecutor with 64 max_workers – in this case, only 60 threads would be started). Second and last, the map method initiates the distribution of work among the workers. It accepts a function and an iterable as its arguments; here, the iterables will be called to obtain function arguments that the workers should feed into the function.

In our case, the situation is slightly more complicated as our function will depend on two arguments: the filepath and the number. Even though the filepath does not change in the for loop we still have to create an iterable with the same length as the range we are iterating over. Here’s how it will turn out.

[sourcecode language=”python” title=”Listing 30. Applying the CRP to our script. ” ]
# other imports unchanged…
from concurrent.futures import ThreadPoolExecutor

# other functions unchanged …
def crawl(filepath, number):
logger.info(“Crawling number %d”, number)
url = construct_url(number)
logger.debug(“URL: %s”, url)
response = requests.get(url)
logger.debug(“Response: %s”, response.text)
content = filter_for_page_content(response.text)
logger.debug(“Content: %s”, content)
try:
label, paragraphs = split_into_label_and_text(content)
except Exception:
logger.exception(“Exception when splitting for number %d”, number)
return
logger.info(“Identified label %s”, label)
logger.debug(“Paragraphs: %s”, paragraphs)
if label not in (“SAFE”, “EUCLID”, “KETER”):
logger.warn(“Unknown label %s for number %d”, label, number)
with open(os.path.join(filepath, f”scp-{number:03d}.txt”), “w”) as f:
write_to(f, label, paragraphs)

# click decorators ommitted…
def crawl_for(lower, upper, filepath):
logger.debug(
“Called with lower = %s, upper = %s, filepath = %s”, lower, upper, filepath
)
with ThreadPoolExecutor(max_workers=64) as executor:
executor.map(
crawl, (filepath for _ in range(lower, upper)), range(lower, upper)
)
[/sourcecode]

As you can see, you should supply the map method as many iterables as your function has arguments.

Exercises

  • Git Tag: ex-12   beginner

    Turn the max_workers number 64 into a click option.

Clean up

After having run the web crawler, watching the 64 threads delightfully crunching through the pages and punching them into text files, it is time to take a look at the results. There are quite a few warnings and errors logged into our warnings.log. Let’s take a look at them and see if we have to modify the web crawler and re-run it once more.

Errors: SCP pages we were not able to crawl correctly

Using the warnings.log, we can estimate how many errors occurred.

[sourcecode language=”bash” title=”Listing 31. Counting errors. ” ]
grep “\[ERROR\]” warnings.log | wc -l
[/sourcecode]

Here, it pays off that we incorporated the log level into our log format. Note that we have to escape the square brackets with backslashes because they have a special meaning in regular expressions. In my run, I got 12 error log messages. Taking a closer look at them, we see that there are quite a few SCPs that have a javascript wall in front them. For instance, we already know about SCP-8. Others have a slightly different HTML structure: SCP-285 has another div element wrapped around the p elements with the content we are interested in. I plan on ignoring all of them for the moment.

Warnings: Unknown labels

Using the warnings.log, we can estimate how many of the crawled SCPs have been assigned an unexpected label. A quick grep combined with the word count utility comes to the rescue:

[sourcecode language=”bash” title=”Listing 32. Counting warnings. ” ]
grep “Unknown label” warning.log | wc -l
[/sourcecode]

For my run with default bounds this yields 56 unknown labels. Closer inspection shows that the majority is not an unknown label but a known label with further information. For instance, SCP-417 is classified as Euclid but the author wanted to note that it could be potentially Keter. Furthermore, there are a few SCPs that apparently have been assigned a finer classification. For example, SCP-66 is classified as “euclid-impetus” and SCP-625 as “euclid-flecto”. Because the majority of the SCPs is not classified this way, I plan on only using the coarse label. The truly unexpected labels are the following:

  • None (48)
  • Thaumiel (179, 378)
  • Neutralized (356, 407, 541, 696, 821, 818)
  • Scarf (586)

For the neutralized ones, a few of them have a previous assigned label such as SCP-818. I could take the former label into account but since we are only talking about a hand full of data points here, I plan on ignoring them altogether. The “Scarf” one is interesting. Apparently, the underlying SCP causes writers to make typos when writing about it. I suppose that the real label should be “Safe”. The SCP belonging to the “None” label seems to be a placeholder. There are also a few (expected) labels with a leading colon, for instance for SCP-75. Apparently, this is caused by the colon not being inside the strong element. This can be fixed with not too much hassle so let’s do it right now.

  • Fixing the “leading colon label” bug

    First, let’s write a test reproducing the bug by copying our test_split method and moving the colon behind the “Object class” out of the strong element:

    [sourcecode language=”python” title=”Listing 33. A testcase for the leading colon label bug. ” ]
    def test_split_with_leading_colon(self):
    test_content = BeautifulSoup(
    “””

    Some caption

    Item #: SCP-xxx

    Object Class: Keter

    Special Containment Procedures:

    Description:

    Other…

    “””,
    features=”html.parser”,
    )
    actual_label, actual_content = split_into_text_and_label(test_content)
    expected_label = “KETER”
    expected_content = [

    Item #: SCP-xxx

    “,

    Special Containment Procedures:

    “,

    Description:

    “,

    Other…

    “,
    ]
    self.assertEqual(expected_label, actual_label)
    self.assertEqual(expected_content, [str(p) for p in actual_content])
    [/sourcecode]

    To make the tests a little more diverse, I also changed the label from “Safe” to “Keter”. Running the tests should get you precisely one fail:

    [sourcecode language=”text” title=”Listing 34. RED: Test output. ” ]
    AssertionError: ‘KETER’ != ‘: KETER’
    – KETER
    + : KETER
    [/sourcecode]

    The easy way to fix it would be to simply do a string replace on the label inside the split_into_text_and_label function:

    [sourcecode language=”python” title=”Listing 35. Fix for the leading colon label bug. ” ]
    label = obj_class_p.contents[-1].strip().upper().replace(“: “, “”)
    [/sourcecode]

    Our tests should be green again. This reduced the unexpected label warnings to 41. We could also make the web crawler deal with labels such as “euclid-impetus” and only write the coarser label to the text file. However, I plan on leaving that to the data transformation blog post.

Preparing for the next step: Editing make targets

The Data Science cookiecutter template defines several make targets that will be useful in the next blog post. Using the make command line utility allows us to execute quite complex command line scripts such as our web crawler with a simple API. Also, it lets us define dependencies such as “only run this task if this source code file changed.”

The make utility is configured via a Makefile. One is already present in the project and for instance defines a clean target that deletes all compiled Python code (that is, it deletes all __pycache__ directories and files ending in .pyc or .pyo). This clean target is executed via make clean. In the Makefile, let’s also add that log files should be cleaned up.

[sourcecode language=”text” title=”Listing 36. Cleaning up with make. ” ]
## Delete all compiled Python files and log files
clean:
find . -type f -name “*.py[co]” -delete
find . -type f -name “*.log” -delete
find . -type d -name “__pycache__” -delete
[/sourcecode]

Now, whenever you execute make clean, all log files will be deleted. Furthermore, we will add a new target (under “PROJECT RULES”) that will execute the web crawler.

[sourcecode language=”text” title=”Listing 37. A new target for the web crawler. ” ]
data/raw: src/data/webcrawl.py
$(PYTHON_INTERPRETER) src/data/webcrawl.py data/raw
[/sourcecode]

Note that this target has a dependency. Namely, it depends on the file src/data/webcrawl.py. What make does is the following: It checks whether the date when the file webcrawl.py has been changed is younger than the date when the directory data/raw has been changed. If so, it executes the following tasks. Otherwise, it will tell you that the target is up-to-date.

Finally, we add the target data/raw as a dependency to the data target.

[sourcecode language=”text” title=”Listing 38. Adding a dependency to a target. ” ]
## Make Dataset
data: requirements data/raw
$(PYTHON_INTERPRETER) src/data/make_dataset.py data/raw data/processed
[/sourcecode]

The data target is a template from the Data Science project. It will be implemented next time when we are dealing with data preprocessing and transformations.

Exercises

  • Git Tag: ex-13   beginner

    The data/raw directory need not exist after having cloned the repository. Edit the data/raw target in the Makefile so that it will be created.

  • Git Tag: ex-14   intermediate

    Add a new target logging_config.ini. Executing this target should copy the file logging_config_template.ini to logging_config.ini. Furthermore, add a new phony target, i.e. a target that does not correspond to a file name, called setup that does not execute any additional actions but depends on the targets logging_config.ini and create_environment.

Conclusion

We have come quite a long way in this blog post. In more detail, you have learnt how to:

  • Write code in a test-driven manner using pytest,
  • Set up logging using the builtin logging module,
  • Implement multi-threading using the builtin concurrent.futures module,
  • Use the requests library to issue a GET request,
  • Make use of the BeautifulSoup library to parse HTML, and
  • Read a Makefile and use make targets.

Hopefully, I could demonstrate how to use cookiecutter templates in general and, more specifically, how to use the Data Science template.

Further reading

I have linked to the documentations of the libraries we have used throughout. However, if you want to take an even deeper dive into some of those topics, I suggest the following.

  • Automate the Boring Stuff with Python by Al Sweigart is a beginner-friendly introduction to automation scripts with Python. It gives you step-by-step instructions for your scripts as well as further projects to work on. In particular, I would recommend the eleventh chapter Web Scraping for strengthening your understanding of web crawling and working with requests, BeautifulSoup and related libraries I have not mentioned in this blog post.
  • Effective Python by Brett Slatkin gives you an overview over best practices of different topics. In particular, I would recommend the fifth chapter Concurrency and Parallelism if you would like to strengthen your understanding on multi-threading and -processing.
  • The examples on how to work with make on the Cookiecutter Data Science page are helpful learning resources.

A Machine Learning Project: Classifying SCPs, Part 0, Overview

Tutorials on Machine Learning tend to emphasize the process of training models, interpreting their scores and tuning hyper-parameters. While these are certainly important tasks, most tutorials and documentations are not concerned with how the data is obtained; mostly, a toy data set is used that is provided by the Machine Learning library.

In a series of at least 3 blog posts I would like to document a Machine Learning project of mine from start to finish: We will start from scratch, write a Web crawler to obtain data, polish and transform it and finally apply the Machine Learning techniques above.

This post will give you an overview over the techniques and tools that will be used. Feel free to directly jump to the follow-up posts and return to this blog post whenever you need instructions on a certain tool.

The task: Classifying SCPs

The heart of the website http://scp-wiki.wikidot.com/ consists of articles about fictional objects with abnormal behaviour, so-called SCPs (short for “Secure, Contain, Protect”). The fictional SCP foundation aims to contain these objects and, if possible, research their potential uses.

scp0_example.png

Each article follows a fixed format starting with the Item #. The second paragraph, the Object Class, is our prediction target. The following paragraphs, consisting of Special Containment Procedures, a Description, and, optionally, further information and addenda, are the data we will use to train a model upon.

A deeper look at the Object Class

As explained by the Object Classes guide, an Object Class is a “rough indicator for how difficult an object is to contain.” The five primary classes are Safe (easy to contain with next to no resources), Euclid (requires some resources to contain), Keter (difficult to contain even with a lot of resources), Thaumiel (used to contain other SCPs) and Neutralized (not anomalous anymore). While there are many Safe-class and Euclid-class SCPs, Thaumiel-class and neutralized SCPs seem to be very rare. Chances are that a Machine Learning model won’t be able to learn what makes up a Thaumiel-class or a neutralized SCP. Consequently, we will concentrate on classifying SCPs as Safe, Euclid or Keter for starters and ignore the other Object Classes first.

The tools

Of course, without an appropriate tool set this task would be hopeless. As the programming language Python provides us with a plethora of Machine Learning frameworks (Tensorflow, PyTorch, Scikit-Learn, …), using it will be a safe choice. But how do we organize our project? What libraries do we use? Fortunately, we are not the first human beings wondering about these issues. In the following, I will give you an overview of the tools we will use and try to explain my decision for them.

Project template: Data Science Cookiecutter

The Data Science Cookiecutter project template is our framework. We are given an organized folder structure, dividing data from data exploration, documentation and utility code. Furthermore, the template defines several make targets and provides instructions on how to extend them. By the way, if you have never heard of Cookiecutter templates before you will surely find other useful templates in the Cookiecutter documentation. After installing cookiecutter via pip, using templates is a breeze: If you want to start a new Data Science project based on the template mentioned before, simply fire the command

cookiecutter https://github.com/drivendata/cookiecutter-data-science

Environment and dependency management: Anaconda

Each Data Science project comes with its own demands and dependencies. To manage those, we will use the Anaconda distribution. While the associated command line tool, conda, may serve as a package manager and comes with several Data Science libraries pre-compiled, you can also use the standard python package manager pip from within a conda environment.

Data exploration: Jupyter Notebooks

Jupyter Notebooks are a wonderful way to combine data exploration code together with its documentation. Sharing them gives others the opportunity to run the code themselves and check the results. Jupyter notebooks support markdown, \(\LaTeX\), can install Python libraries on-the-fly via Magic commands and can be converted to other formats such as pdf or html. If you have never heard of them but know how to use Python, you have definitely missed something! The Python Data Science Handbook gives you a nice introduction to IPython that forms the base of Jupyter notebooks.

Web crawler: Requests and Beautiful Soup

For the Web crawling task, the requests library gives us a quick way to obtain data from web servers. To parse html, we will use Beautiful Soup.

Transformations & Machine Learning: Scikit-Learn

The open source Machine Learning library Scikit-Learn has everything we need and is easy to use. There are plenty of examples on their page, along with explanations of the underlying concepts (of course, some math background will come in handy if you want to understand them fully).

Quickfix: ‘Push rejected’ when deploying a Python app on Heroku

Heroku is a platform that enables you to deploy your web application in a quick and painless manner; unless you’re stumbling upon a ‘Push rejected’ error with next to no hint how to resolve it.

Problem description

I stumbled upon this when trying to deploy a flask app but I’m pretty sure this will also happen with django apps. Here’s what happened: When deploying to Heroku you first have to initialize a git repository. For Python apps in particular, you also have to provide a requirements.txt that consists of the dependencies of your application. This is pretty standard. What was new to me is that you also have to provide a runtime.txt that tells Heroku which Python version your app was written in. As naive as I was, I merely checked for the Python version in my Anaconda environment and put python-3.7.4 in the file, committed and pushed to Heroku via git push heroku master. Here’s what I got back:

remote: Compressing source files... done.
remote: Building source:
remote:
remote: -----> Python app detected
remote:  !     Python has released a security update! Please consider upgrading to python-3.7.3
remote:        Learn More: https://devcenter.heroku.com/articles/python-runtimes
remote: -----> Installing python-3.7.4
remote: -----> Installing pip
remote: -----> Installing requirements with pip
remote:  !     Push rejected, failed to compile Python app.
remote:
remote:  !     Push failed
An example of a non-helpful build log on heroku.com
An example of a non-helpful build log on heroku.com

That was not helpful at all. Especially the hint at upgrading to python-3.7.3 was confusing me because my proposed runtime was even more recent! The build log did not reveal anything new either.

Solution

After being confused for about half an hour, suspecting that Heroku could not cope with my requirements at first, I finally found out that Heroku does not support python-3.7.4 as a runtime! In fact, Heroku only supports three specific python runtimes at the moment. Thus, the solution is to simply use python-3.7.3 as a runtime.

The Bayes classifier

The Bayes classifier is a Data Scientist’s unbeatable arch enemy. When faced with the challenge of finding an optimal solution to a classification problem, it is the ultimate (theoretical!) benchmark when evaluating the performance of a classifier. The problem with the Bayes classifier is that it uses information that generally is not available to us in the real world. Therefore, it is more like a theoretical tool that shows that an optimal classifier always exists. Also, it shows us that given enough information about a problem, then a simple idea can lead to the optimal solution.

This blog post was inspired by exercise 7 of chapter 3.5 of the book Understanding Machine Learning. Since this blog post contains some hints for solving this exercise, it is a good idea to try and solve it yourself first. If you get stuck, feel free to return to this blog post.

Means and medians revisited

The Bayes classifier is related to the ordinary median; since we are about to find out why the Bayes classifier is always optimal (given that we precisely know the distribution of our data) let us first revise what optimization problem the median solves. Since the proof will be somewhat computation heavy, let us do the same for the mean first. This may be seen as an add-on to the blog post Means and medians where a general theme about these two invariants has already emerged: While both serve similar purposes, usages of the median in theoretical contexts are always more involved.

Theorem. Let \(X\) be a real valued random variable.

  • The mean minimizes the \(L^2\)-distance to \(X\), i.e., it satisfies \[ \mathbb{E}[X] = \mathop{\mathrm{argmin}}_{c \in \mathbf{R}} \mathbb{E}\left[(X-c)^2\right]^{1/2}. \]
  • The median minimizes the \(L^1\)-distance to \(X\), i.e., it satisfies \[ \mathbb{M}[X] = \mathop{\mathrm{argmin}}_{c \in \mathbf{R}} \mathbb{E}[|X-c|], \] where \(|\cdot|\) denotes the absolute value.

The theorem tells us that both the mean and the median are real numbers that are closest to the random variable; however, they are closest in different measures.

A short digress on different distance metrics

In case you are still wondering about what the different notions of distances are supposed to mean, the following should get you covered. There is an intrinsic need for different distance metrics: While a plane may travel the direct path from one point of the world to another, a car has to stick to roads and a train is bound to rails. So what exactly is a distance? Or, more precisely, what are the essential features of the different distance metrics that all of them share? This is where mathematicians get excited: Extracting similarities from a priori different entities and finding an appropriate abstraction that unites them is one of our main selling points. Here’s what mathematicians have come up with a long time ago:

A distance metric is a function assigning a real value (the distance) to any two points in space such that

  1. the distance from any point to itself is zero,
  2. the distance from any point \(a\) to another \(b\) is the same as the distance from \(b\) to \(a\) (distances should be symmetric), and
  3. the distance from any point \(a\) to another \(b\) is always smaller than or equal to the sum of the distances from \(a\) to yet another point \(c\) and from \(c\) to \(b\) (making roundabouts increases distance).

A related measure is length. Imagine a two dimensional coordinate system. If we use the usual picture of vectors as arrows in that coordinate system then we may assign a real value to each vector by measuring the length of that arrow. This yields the so-called euclidean norm or the \(2\)-norm. The formula for the length of the vector pointing from the origin to the point \((x,y)\) is \(\|(x,y)\|_2 := \sqrt{x^2 + y^2}\). We obtain a distance metric when postulating that this length is the distance from the point \((x,y)\) to the origin \((0,0)\). There are many other kinds of lengths; one of many examples is the \(1\)-norm defined by \(\|(x,y)\|_1 := |x| + |y|\). The distance metric induced by this norm is the distance when you are only allowed to move horizontally or vertically across two dimensional space. The difference between the 1-norm and the 2-norm.

Let us turn to (discrete) random variables and how to make sense of the \(L^1\)- and \(L^2\)-distances. As you might suspect by now, they are related to the \(1\)- and \(2\)-norms, respectively. Let us assume that we are given a random variable \(X\) attaining two values \(x\) and \(y\). Let the probability that \(X = x\) be \(p\) so that \(P(X=y) = 1-p\). By the explicit formula for the mean, we see \[ \mathbb{E}[|X|] = p|x| + (1-p)|y| = \|(px, (1-p)y\|_1\] and, likewise, \[ \mathbb{E}[X^2]^{1/2} = \sqrt{px^2 + (1-p)y^2} = \|(\sqrt{p}x, \sqrt{1-p}y)\|_2.\] This means that the \(L^1\)-distance is a probability-weighted version of the \(1\)-norm, while the same holds for the \(L^2\)-distance and the \(2\)-norm. This might explain why the mean is simpler to use in theoretical contexts or why it might feel more natural (especially given its simple definition): It comes from minimizing a distance metric induced by the natural euclidean distance.

The proof

For the mean

Let us tackle the mean first. There is a pretty straightforward proof that uses calculus: First notice that because the square root is a monotonic increasing function, is suffices to show that the mean minimizes \(c \mapsto \mathbb{E}\left[(X-c)^2\right]\). Then, by linearity of the expected value, this is a quadratic polynomial in \(c\) so that it attains precisely one minimum. By calculus, computing that minimum is now just a matter of taking the derivative and solving a linear equation. However, I want to give you one proof that does not even need calculus. We merely need to use an equation that you might know as the third binomial formula: \(a^2 – b^2 = (a+b)(a-b)\). Here’s how it goes. Getting rid of the square root is done by the same argument as before. Then, for any two real values \(c\) and \(d\), we compute

\begin{align*}
\mathbb{E}[(X-c)^2] – \mathbb{E}[(X-d)^2] &= \mathbb{E}[(X-c)^2 – (X-d)^2] \\
&= \mathbb{E}[(X-c + X-d)(X-c-X+d)] \\
&= \mathbb{E}[(2X-c-d)(d-c)] \\
&= (2\mathbb{E}[X]-c-d)(d-c).
\end{align*}

Here, we have used linearity of the expected value in the first step, the third binomial formula with \(a = X-c\) and \(b = X-d\) in the second and linearity of the expected value in the last as well as the fact that the expected value of a constant random variable (such as \(c\) and \(d\)) equals that constant. This chain of equations holds for any two real values. This means it particularly holds for \(d = \mathbb{E}[X]\). Let’s plug this in:

\begin{align*}
\mathbb{E}[(X-c)^2] – \mathbb{E}[(X-\mathbb{E}[X])^2] &= (2\mathbb{E}[X]-c-\mathbb{E}[X])(\mathbb{E}[X]-c) \\
&= (\mathbb{E}[X]-c)^2 \geq 0,
\end{align*}

as the square of any real number is non-negative. Reformulating, this means \[ \mathbb{E}[(X-c)^2] \geq \mathbb{E}[(X-\mathbb{E}[X])^2] \] holds for any real value \(c\) showing that \(\mathbb{E}[X]\) minimizes \(L^2\)-distance to \(X\).

For the median

While the idea for showing that the median minimizes \(L^1\)-distance is the same as in the proof above for the mean, be warned that this part is a lot more finicky. Because the absolute value is not differentiable, a straightforward proof using calculus is not available either. We will start like before by computing \(\mathbb{E}[|X-c|] – \mathbb{E}[|X-d|]\) for any two real values \(c\) and \(d\). One technicality I have to introduce to you is the so-called indicator function of a set \(A\) that I will denote by \(\chi_A\). This is a function that is \(1\) on \(A\) and \(0\) anywhere else. We will use it to deal with certain case distinctions in a concise way. For starters, let us assume \(d > c\) and compute

\begin{align*}
|X-d| – |X-c| &= \chi_{\{X \leq c\}}(|X-d| – |X-c|) + \chi_{\{X > c\}}(|X-d| – |X-c|) \\
&= \chi_{\{X \leq c\}}(d-X-c+X) + \chi_{\{X > c\}}(|X-d|- X + c) \\
&= \chi_{\{X \leq c\}}(d-c) + \chi_{\{d > X > c\}}(d-X – X + c) + \chi_{\{X \geq d\}}(X-d- X + c) \\
&= \chi_{\{X \leq c\}}(d-c) + \chi_{\{d > X > c\}}(d-2X + c) + \chi_{\{X \geq d\}}(-d+c) \\
&\geq \chi_{\{X \leq c\}}(d-c) + \chi_{\{d > X > c\}}(-d+c) + \chi_{\{X \geq d\}}(-d+c) \\
&= \chi_{\{X \leq c\}}(d-c) + \chi_{\{X > c\}}(-d+c).
\end{align*}

Here, we first split up the computation into the cases \(X \leq c\) and \(X > c\), getting rid of some of the absolute values. The next step was to split up the second case into the cases \(d > X > c\) and \(X \geq d\). Finally, in this second case we realized that we can estimate \(-2X > -2d\). After that, we simply united these two cases back together. Now, applying the expected value to this inequality and using monotony and linearity, we end up with

\begin{align*}
\mathbb{E}[|X-d|] – \mathbb{E}[|X-c|] &\geq \mathbb{E}[\chi_{\{X \leq c\}}](d-c) + \mathbb{E}[\chi_{\{X > c\}}](-d+c) \\
&= (d-c)\left(P(X \leq c) – P(X > c)\right) \\
&= (d-c)\left(2P(X \leq c) – 1\right)
\end{align*}

In the case \(d < c\), we may do a very similar computation by splitting up on \(X \geq c\) and \(X < c\). In this case, we end up with \[\mathbb{E}[|X-d|] – \mathbb{E}[|X-c|] \geq (c-d)\left(2P(X \geq c) – 1\right).\] What was this good for? If we plug in \(c = \mathbb{M}[X]\) in either case, we end up with \[\mathbb{E}[|X-d|] – \mathbb{E}[|X-\mathbb{M}[X]|] \geq 0\] as \(P(X \geq \mathbb{M}[X]) = P(X \leq \mathbb{M}[X]) = 1/2\). This shows that the median minimizes \(L^1\)-distance.

Classification problems

Let us use stochastics to model the following setup. We are given a bunch of data samples. Each data sample is either labeled (classified) as positive (1) or negative (0). Given an unlabeled data sample, we are tasked with giving a good estimate of whether it should be labeled as positive or negative.

An example

We are given news articles and are tasked with finding out whether they are about sports or not. In this case, our data samples consist of word count vectors generated from those articles (this means that for each article we create a table containing the information how often words appear in the article). Our positive data samples will be the sports articles.

Even for this simple example, we see that both the labeling as well as the information in the data sample are dependent of each other: For instance, an article containing sports-related words (‘football’, ‘competition’, …) is much more likely to be, in fact, a sports article while an article labeled as a sports article is much more likely to contain sports-related words more often than the average article.

An inference pipeline transforming an article into a word count vector that is fed into a classifier.
A classical inference pipeline consisting of a data transformation and a classifier receiving the transformed data.

Furthermore, this is a classical example of an inference pipeline. For data to become interpretable by a classifier, it has to be transformed appropriately first. In this case, the raw data might be given by the article, its headlines, and sub headlines. The raw data will then be transformed into a word count vector that in turn is fed into a classifier that predicts whether the article is about sports or not.

Modelling classification problems

Since we do not know anything else about the problem, we might as well assume that the data samples are given to us at random, adhering to a probability distribution \(P\). More precisely, given a set \(S\) as our sample value space, we think of the data samples as an \(S\)-valued random variable \(X\) and of the labeling as a \(\{0,1\}\)-valued random variable \(Y\). In general, \(X\) and \(Y\) will very much depend on each other, as we saw in the preceding example: The values of \(X\) could be word count vectors while \(Y\) could be the labeling of the underlying article as being sports-related or not. We could model the bunch of data samples as a set; however, it is possible that there are duplicate data samples, even with the same label. Since we want to research on the underlying probability distribution, the occurrence of such duplicate data samples could carry precious information. Thus, we opt for modeling the given data samples as an element of \((S \times \{0,1\})^m\), with \(m\) being the amount of data samples obtained. Given such a tuple, we are tasked with constructing a classifier, i.e. a function \[f\colon S \to \{0,1\}.\]

Loss functions

How do we know if we have actually found a good classifier? What could good mean in this context? The idea of a loss function is to penalize each false prediction of the classifier. The penalty could be a constant number or it could depend on whether or positive or a negative sample has been falsely predicted. For this simple task at hand, the easiest would be to take the empirical risk.

Definition. Given an \(m\)-tuple of labeled data samples \(D\), the empirical risk of a classifier \(f\) is given by \[ L_{\mathrm{ER},D}(f) := \frac{|\{(x,y) \in D \mid f(x) \neq y \}|}{m}. \]

In other words, the empirical risk of a classifier computes the fraction of falsely predicted data samples. This means that the classifier is penalized by \(1/m\) for each false prediction. While this might be a good start for evaluating the performance of a classifier, concentrating on minimizing empirical risk alone is prone to overfitting, i.e. producing classifiers that perform well on the data samples obtained but fail miserably on new data.

What we should be more interested in, even though will not be computable in real world examples, is the true risk of a classifier.

Definition. If \(P\) is the probability distribution of the tuple of data samples and labelings \((X,Y)\), then the true risk of a classifier \(f\) is given by \[ L_P(f) := P(\{(x,y) \mid f(x) \neq y\}).\]

In other words, the true risk of a classifier penalizes false predictions of labeled data \((x,y)\) by the true probability of them occurring. In comparison, the empirical risk tries to approximate the true probability by the relative frequency of the labeled data sample in our sample set. If our sample set is large enough, this should be a good approximation; however, if we are out of luck, our sample set could contain labeled data samples with much higher frequency than they usually appear in the wildness. This leads to some interesting approaches of dealing with classification problems by minimizing the true risk for all probability distributions simultaneously. These approaches allow for a slim margin of error depending on the size of the sample set. But this is for another blog post.

The Bayes classifier

What would we do if we knew the underlying probability distribution \(P\)? Given an unlabeled data sample \(x \in S\), we could compute the probability of its label being positive \(P(Y = 1 \mid X = x)\). This is a conditional probability and may be read as the probability of \(Y\) being \(1\) given that \(X = x\). If this probability is bigger than \(1/2\) we should place our bet on the label being positive. Otherwise, it should be negative. This simple assignment is the Bayes classifier.

Definition. For the probability distribution \(P\) of \((X,Y)\), the Bayes classifier is the classifier given by \[ f_P\colon x \mapsto \begin{cases} 1 & P(Y=1 \mid X = x) \geq 1/2 \\ 0 & \text{else}\end{cases}.\]

An idea, no matter how simple, might yield a good result. In this case, in fact, the simple idea underlying the Bayes classifier yields the optimal result.

Theorem. The Bayes classifier minimizes the true risk. More precisely, given any other classifier \(f\colon S \to \{0,1\}\), the inequality \[ L_P(f) \geq L_P(f_P) \] holds.

Note that even though the Bayes classifier minimizes the true risk, its empirical risk could be arbitrarily high. Indeed: Consider a sample set consisting of one data sample that is labeled as positive. Assume that the probability of that data sample being labeled positively is smaller than \(1/2\). This one data sample is then mislabeled by the Bayes classifier yielding an empirical risk of \(1\).

The connection to the median

Before diving into the proof and to prepare for it, let me explain the connection between the Bayes classifier and the median of a random variable. In fact, the value \(f_P(x)\) for any unlabeled data sample \(x\) may be interpreted as the conditional median of the random variable \(Y\) given \(X = x\). Here’s what this means: The value of the Bayes classifier satisfies

\begin{align*}
P(Y \geq f_P(x) \mid X = x) &\geq 1/2 \text{ and} \\
P(Y \leq f_P(x) \mid X = x) &\geq 1/2.
\end{align*}

Indeed, if \(f_P(x) = 1\) then \[P(Y \geq f_P(x) \mid X = x) = P(Y = 1 \mid X = x) \geq 1/2\] and \[P(Y \leq f_P(x) \mid X = x) = P(Y \in \{0,1\} \mid X = x) = 1 \geq 1/2.\] The proof for \(f_P(x) = 0\) is similar.

A sketch for the proof

As the proof for the Bayes classifier minimizing the true risk uses conditional expectations that I have not yet talked about, I will only give a sketch consisting of hints for the experts here. Also, this is the solution of exercise 7 in chapter 3.5 of Understanding Machine Learning; feel free to read this if you are stuck but I suggest on getting your hands dirty yourself first and giving this exercise a good try.

Ready? The first step is to rephrase the true risk as \[ L_P(f) = \mathbb{E}[|Y – f(X)|]. \] Then, we may use one of the central properties of conditional expectations: The expected value of a conditional expectation of a random variable is the same as the expected value of the random variable. This yields \[ L_P(f) = \mathbb{E}\left[\mathbb{E}[|Y – f(X)| \mid X]\right]. \] Now we are almost done. By monotony of the expected value, it suffices to minimize \(\mathbb{E}[|Y – f(X)| \mid X]\) pointwise. And this is precisely what the Bayes classifier does by the preceding paragraph: For each \(x \in S\), \(\mathbb{E}[|Y – f(x)| \mid X = x]\) is minimal if \(f(x)\) is the conditional median of \(Y\) given \(X = x\) which is the value \(f_P(x)\) of the Bayes classifier.

Further reading