Classifying SCPs, Part 2: Data transformation (TF-IDF) and preprocessing

After we have obtained data through the means described in the first part of this blog post series, it is time to deal with data transformations and data preprocessing. While humans can comprehend textual information in the form of articles, it is hard to digest for a Machine Learning algorithm. In this blog post, we will transform the textual information into certain vectors assigning a number to each word in the vocabulary of the set of articles: This is what the TF-IDF (Term Frequency – Inverse Document Frequency) is all about.

In comparison to the web crawler post, this one is more mathematical in nature. Instead of evaluating technical approaches and executing them in a test-driven manner, we will have to understand the mathematical background behind the algorithm to put it to good use.

To make the transition as gentle as possible, let us do a warm-up that is closer to the technical spirit of the last blog post: We use the text files produced by the web crawler from the last blog post to extract certain lengths of paragraphs and research in what way these help us to determine the Object Class of the article.

After we have understood how TF-IDF works, we can use it to transform our articles into TF-IDF vectors. Consequently, we will already be able to extract keywords from each article.

Warm-up: Extracting the paragraph lengths

To structure the text from the articles, let us use a custom class.

Basic version of the Article class

class Article(object):
    def __init__(self, label, name, procedures, desc):
        self.label = label.strip()
        self.name = name.strip()
        self.procedures = procedures.strip()
        self.desc = desc.strip()

The logic that splits up the text from the text files into attributes of the class will be a classmethod that accepts a list of lines of text and returns a readily constructed Article instance.

class Article(object):
    # __init__ omitted...

    @classmethod
    def from_text(cls, lines):
        label, *text = lines
        text = "".join(text)
        name, rest = text.split("Special Containment Procedures:")
        procedures, desc = rest.split("Description:")
        return cls(label, name, procedures, desc)

Here’s a basic test that shows how to use the classmethod to obtain an Article instance.

from src.data.article_data import Article


def test_from_text():
    procedures = [
        "Special Containment Procedures: Something...",
        "Something part two...",
    ]
    description = "Description: Something else..."
    article = Article.from_text(["SAFE", "Some name   ", *procedures, description])
    assert article.label == "SAFE"
    assert article.name == "Some name"
    assert article.procedures == "Something...Something part two..."
    assert article.desc == "Something else..."

Validation of the label through a @property

As mentioned in the last first part of this series, we are only concentrating on the labels SAFE, EUCLID and KETER. To account for this, we need to validate that the incoming label is one of those. We are a little more lenient and also accept labels that only start with one of those three labels.

Let us write tests first to define the desired behavior.

import pytest
from src.data.article_data import Article


@pytest.fixture
def article():
    return Article("SAFE", "", "", "")


@pytest.mark.parametrize("label", ["SAFE", "EUCLID", "KETER"])
def test_set_regular_label(article, label):
    article.label = label
    assert article.label == label
    article.label = label + "SOMETHING"
    assert article.label == label


def test_set_unknown_label(article):
    with pytest.raises(ValueError) as excinfo:
        article.label = "unknown"
    assert "unknown" in str(excinfo)

In the tests above, we are using a fixture that gives us an initialized Article instance. Then, we are defining the regular behavior of the setter (we are expecting the label to accept the three main object classes as well as labels that start with those) and what happens when the setter encounters an unknown label (we are expecting a ValueError, enforced via the raises helper).

Because we have not written any validation for the label attribute yet, the tests fail. To account for these kinds of validations, Python has @property decorators that allow for custom getter and setter methods.

class Article(object):

    ALLOWED_LABELS = ("SAFE", "EUCLID", "KETER")

    # __init__ omitted...

    @property
    def label(self):
        return self._label

    @label.setter
    def label(self, orig_label):
        labels = [
            label for label in self.ALLOWED_LABELS if orig_label.startswith(label)
        ]
        if not labels:
            raise ValueError(f"Unknown label '{orig_label}'!")
        self._label = labels.pop()

The Python interpreter calls the method decorated with @label.setter as soon as it encounters the line self.label = label in the __init__ method. As a result, code that uses this class has to deal with ValueErrors when constructing instances.

Adding a to_dict method

While the Article class is responsible for extracting information from the articles, it is much easier to use a simple dictionary when persisting extracted information. That is because the json library can serialize Python dictionaries directly; additionally, the pandas Data Science library is able to use dictionaries to construct their main object: a DataFrame. As a result, we need to write a to_dict method that turns an Article instance into a plain dictionary. Aside from the four attributes of the Article class, we also require the dictionary to contain the (character) lengths of the Procedures and the Descriptions as well as the Ratio of these two lengths.

def test_to_dict_trivial_article(article):
    d = article.to_dict()
    assert "Label" in d
    assert d["Label"] == "SAFE"
    assert "Name" in d
    assert "Procedures" in d
    assert "Description" in d
    assert "Procedures_Length" in d
    assert d["Procedures_Length"] == 0
    assert "Description_Length" in d
    assert d["Description_Length"] == 0
    assert "Procedures_Description_Ratio" in d
    assert d["Procedures_Description_Ratio"] == 0


def test_to_dict(article):
    article.name = "Test"
    article.procedures = "TestTest"
    article.desc = "Test"
    d = article.to_dict()
    assert "Label" in d
    assert d["Label"] == "SAFE"
    assert "Name" in d
    assert d["Name"] == "Test"
    assert "Procedures" in d
    assert d["Procedures"] == "TestTest"
    assert "Description" in d
    assert d["Description"] == "Test"
    assert "Procedures_Length" in d
    assert d["Procedures_Length"] == 8
    assert "Description_Length" in d
    assert d["Description_Length"] == 4
    assert "Procedures_Description_Ratio" in d
    assert d["Procedures_Description_Ratio"] == 2

The implementation is straightforward and uses a dictionary comprehension.

    def to_dict(self):
        return {
            "Label": self.label,
            "Name": self.name,
            "Procedures": self.procedures,
            "Description": self.desc,
            "Procedures_Length": len(self.procedures),
            "Description_Length": len(self.desc),
            "Procedures_Description_Ratio": len(self.procedures) / len(self.desc)
            if len(self.desc) > 0
            else 0,
        }

Using the Article class to process the txt files

Finally, we want to use the Article class to process the text files. More precisely, we would like to aggregate the articles into a pandas DataFrame. This object has a to_json method that allows us to persist it for later introspection.

First, let us write a test to pin down our expectations.

import pandas as pd
from click.testing import CliRunner
from src.data.make_dataset import main

TEST_DATA = {
    "scp-002.txt": """EUCLID\n
Item #: 002\n
Special Containment Procedures: Something something...\n
Description: Something else...\n
""",
    "scp-003.txt": """UNKNOWN\n
Item #: 003\n
Special Containment Procedures: Something something...\n
Description: Something else...\n
""",
}


def test_main():
    runner = CliRunner()
    with runner.isolated_filesystem():
        for filename, text in TEST_DATA.items():
            with open(filename, "w") as f:
                f.write(text)
        result = runner.invoke(main, [".", "."])
        assert result.exit_code == 0
        df = pd.read_json("data.json")
        assert len(df.index) == 1
        data = df.loc[0]
        assert "Label" in data
        assert data["Label"] == "EUCLID"
        assert "Name" in data
        assert data["Name"] == "Item #: 002"
        assert "Procedures" in data
        assert data["Procedures"] == "Something something..."
        assert "Description" in data
        assert data["Description"] == "Something else..."

Here, we are using the dictionary TEST_DATA to write two files with two mock articles. The first is a regular article with a valid object class, the second one is an article we do not wish to process. As a result, we expect that only one article is present in the processed data. Note that we are using pandas’ read_json method to obtain a DataFrame and, in turn, we are using DataFrame methods to assure that only one article is present and that the article data has been split up correctly.

To make this test pass we have to implement the following strategy:

  1. Start with an empty DataFrame.
  2. Parse each text file in the data/raw folder and turn it into an Article.
  3. Use Article’s to_dict method to append the data to the DataFrame.
  4. Persist the DataFrame to a json file in the data/processed folder.

Here’s a sketch implementation that uses the pathlib library’s glob function to iterate through the text files.

import pandas as pd
from pathlib import Path
from src.data.article_data import Article


df = pd.DataFrame({})
for file in Path("data/raw").glob("scp-*.txt"):
    with file.open() as f:
        article = Article.from_text(f.readlines())
    df = df.append(article.to_dict(), ignore_index=True)
df.to_json("data/processed/data.json")

From a software design perspective, this code leaves a lot to be desired. First, there are no log messages that will come in handy when things are off. Second, the paths are hard-coded and should be replaced by function parameters. Third and last, the Article’s classmethod from_text throws a ValueError each time it encounters an article with an unknown object class. We have to deal with this kind of situation without letting the entire script fail.

Here’s a revision of the sketch.

import click
import logging.config
import pandas as pd
from pathlib import Path
from src.data.article_data import Article


PROJECT_DIR = Path(__file__).resolve().parents[2]
logging.config.fileConfig(PROJECT_DIR / "logging_config.ini")
logger = logging.getLogger(__name__)


@click.command()
@click.argument("input_filepath", type=click.Path(exists=True))
@click.argument("output_filepath", type=click.Path())
def main(input_filepath, output_filepath):
    """ Runs data processing scripts to turn raw data from (../raw) into
        cleaned data ready to be analyzed (saved in ../processed).
    """
    logger.info("making final data set from raw data")
    df = pd.DataFrame({})
    for file in Path(input_filepath).glob("scp-*.txt"):
        logger.info("File: %s", file)
        with file.open() as f:
            try:
                article = Article.from_text(f.readlines())
            except ValueError as e:
                logger.warning("ValueError in file %s: %s", file, e)
                continue
        df = df.append(article.to_dict(), ignore_index=True)
    logger.info("DataFrame extracted. Writing to data.json in %s", output_filepath)
    df.to_json(Path(output_filepath) / "data.json")
    logger.info("Done.")


if __name__ == "__main__":
    main()

Note that we are emitting log warning messages whenever we encounter an unknown label but still continue with the processing.

Exercises

Just like in the last blog post, you can rapidly tackle an exercise by using git tags. For instance, if you want to tackle the first exercise, issue the command git tag ex-15 and start coding. If you want to compare your solution to mine, issue git diff sol-ex-15 when you have finished.

Git tag: ex-15   beginner

Add a __repr__ method to the Article class.

Git tag: ex-16   beginner

Add another test for the from_text method. The input shall be the same as in the test_from_text method except that you will leave out the name (“Some name ”). Consequently, assert that the name of the resulting Article instance will be an empty string.

Git tag: ex-17   intermediate

Unfortunately, there are some SCP articles that slightly diverge from the usual naming convention for their parts. For instance, SCP-524 has a Special Containment Procedure (singular!), SCP-2944 has Secure Containment Procedures, and SCP-931 consists of haiku paragraphs. While we could certainly be a little more thorough when parsing them, I will ignore them for the rest of this blog post series (I have encountered 130 warnings when parsing the first 3000 SCP articles which is less than 0,5% of incorrectly parsed articles). However, if you want to, feel free to optimize the parsing procedure. For starters, allow for the “Description” part to start with either “Description:” or “Summary:”. Do not forget to write tests!

Git tag: ex-18   intermediate

Raise RuntimeErrors whenever the Containment Procedures or the Description cannot be extracted. Catch these RuntimeErrors in make_dataset.py, log the error and continue with the for loop without adding the article to the DataFrame. Finally, add another test article in test_make_dataset.py with an unexpected beginning of the description and tests to test_article_data.py to make sure these RuntimeErrors are actually raised.

Quick analysis of the lengths

After we have extracted the (character) lengths of the two parts of the SCP articles, let us analyze them. We will use pandas to load the json file and compute some basic statistical measures.

Open jupyter lab (either by opening a terminal and issuing the command jupyter lab or by opening Anaconda, switching to the environment for the SCP project and open Jupyter Lab there), navigate to the notebooks folder of the SCP project and click the “+” icon above the folder breadcrumbs to fire up a new launcher.

scp2_jupyterlab_launcher.png

In the opening Launcher tab, choose a Python 3 Notebook. Now you are all set up to experiment with data interactively. The following is a transcript of my Jupyter notebook session.

Computing statistics of the lengths

We want to check that all the transformations we have done so far are sane so that we can work with a cleaned up dataset.

  import pandas as pd

  df = pd.read_json("../data/processed/data.json")
  df.head()
Table 1: Out[1]
  Description Description_Length Label Name Procedures Procedures_Description_Ratio Procedures_Length
0 SCP-1256 is a 24-page pamphlet entitled ’Bees … 1837 SAFE Item #: SCP-1256 Mobile Task Force Zeta-4 (’Beekeepers’) is cur… 0.224279 412
1 SCP-2987 is a modified MSI brand external hard… 2187 SAFE Item #: SCP-2987 SCP-2987 is to be kept on floor 17 of Site-88…. 0.203475 445
2 SCP-2039 collectively refers to two distinct f… 5399 EUCLID Item #: SCP-2039 Presently, Foundation efforts at Research Faci… 0.368772 1991
3 SCP-1530 is a two-story abandoned house locate… 3893 EUCLID Item #: SCP-1530 SCP-1530 is currently contained 120 meters fro… 0.201387 784
4 SCP-1524 is the sole remaining specimen of a s… 3211 EUCLID Item #: SCP-1524 Both of SCP-1524’s individual components are t… 0.530364 1703

Let’s look at some statistics of the extracted text lengths and the ratio.

  df.describe()
Table 2: Out[2]
  Description_Length Procedures_Description_Ratio Procedures_Length
count 2700.000000 2700.000000 2700.000000
mean 3208.542222 0.286840 777.595556
std 1658.345674 0.293568 519.808074
min 61.000000 0.000000 0.000000
25% 2104.750000 0.145726 414.750000
50% 2887.000000 0.229935 656.500000
75% 3957.000000 0.353646 994.250000
max 31618.000000 7.377049 7922.000000

Whereas count, mean, min and max are self-explanatory, std stands for standard deviation. The rows with percentages are the 25%-, 50%-, and 75%-quantiles, respectively. They were defined in my Blog post on means and medians. Here’s a short refresher: The 25%-quantile is a value such that 25% of the data is smaller than or equal to it and the other 75% of the data is greater than or equal to it. The 50%-quantile is also known as the median.

The minimum of 61 characters in Description_Length looks reasonable but a Containment Procedure with 0 characters? This has to be investigated. Before we do so, let us look at the same statistics but grouped by each label.

  df.groupby("Label").describe().stack()
Table 3: Out[3]
    Description_Length Procedures_Description_Ratio Procedures_Length
Label        
EUCLID count 1274.000000 1274.000000 1274.000000
  mean 3244.361852 0.308139 855.422292
  std 1701.660229 0.273383 529.896660
  min 428.000000 0.011165 148.000000
  25% 2179.250000 0.169438 497.250000
  50% 2935.500000 0.259065 727.000000
  75% 3977.750000 0.371186 1075.750000
  max 31618.000000 6.051948 7922.000000
KETER count 314.000000 314.000000 314.000000
  mean 3380.487261 0.401208 1128.343949
  std 1694.007237 0.328462 605.260134
  min 233.000000 0.000000 0.000000
  25% 2243.000000 0.218239 683.250000
  50% 3197.500000 0.332694 1028.000000
  75% 4192.250000 0.486212 1449.750000
  max 10141.000000 3.781726 3449.000000
SAFE count 1112.000000 1112.000000 1112.000000
  mean 3118.951439 0.230143 589.388489
  std 1592.721215 0.293088 392.807626
  min 61.000000 0.010626 64.000000
  25% 2003.000000 0.118879 326.000000
  50% 2791.500000 0.178353 488.500000
  75% 3860.750000 0.277565 730.500000
  max 12331.000000 7.377049 3680.000000

This is where it starts to get interesting! As safe SCPs are much easier to contain than euclid ones which in turn are easier to contain than keter SCPs, we expect that the Containment Procedures are easier to describe for safe ones and need more elaborate descriptions for keter ones. On average, this is reflected in the mean length of Containment Procedures (579 for safe, 833 for euclid and 1108 for keter).

Let us turn to the problematic cases of zero lengths.

  df.loc[(df["Procedures_Length"] == 0) | (df["Description_Length"] == 0)]
Table 4: Out[4]
  Description Description_Length Label Name Procedures Procedures_Description_Ratio Procedures_Length
1340 SCP-1994 is the general designation for a set … 1376 KETER Item #: SCP-1994   0.0 0

Thankfully, this is a single outlier. Investigating the article on the SCP Foundation web page and inspecting the html yields that the label “Special Containment Procedures” sits in its own p element so that we were not able to crawl this article correctly.

Let us ignore the outlier.

  df = df.loc[df["Procedures_Length"] > 0]

Finally, let us compute correlations between our features and the target. The correlation coefficient may be computed for number-valued random variables only. Fortunately, the nominal labels safe, euclid, and keter, carry ordinal information. That is to say, we can order them by their containment complexity. To make this even more explicit, let us assign numbers to the three labels. A safe label will be converted to -1, a euclid label to 0 and a keter label to 1 so that the order of the containment complexity is reflected by \(\mathrm{safe} < \mathrm{euclid} < \mathrm{keter}\). However, the magnitude of this conversion is still open for discussion. We could also have choosen \(10^{100}\) for keter and this would have influenced the correlation coefficients. But let’s stick to our simple way of converting for now.

  COMPLEXITY = {
      "SAFE": -1,
      "EUCLID": 0,
      "KETER": 1
  }

  def compute_complexity(label):
      return COMPLEXITY[label]

  df["Complexity"] = df["Label"].apply(compute_complexity)
  df.corr()
Table 5: Out[6]
  Description_Length Procedures_Description_Ratio Procedures_Length Complexity
Description_Length 1.000000 -0.293831 0.220675 0.052532
Procedures_Description_Ratio -0.293831 1.000000 0.577548 0.188953
Procedures_Length 0.220675 0.577548 1.000000 0.344329
Complexity 0.052532 0.188953 0.344329 1.000000

As it turns out, Complexity and Procedures_Length are positively correlated which is precisely what we have observed through the statistics that we have grouped by label. We also see that Description_Length is only very weakly correlated with Complexity: That is to say that there is no reason why, say, a safe SCP should not have a long description or why a keter SCP could not be described in a short manner.

Mathematical background behind TF-IDF

Before we get to enjoy the ease of use of sklearn’s Transformer API to apply the TF-IDF transformation, let’s try and get some understanding of it first. The easiest way to turn words into numbers is to count them. As simple as this sounds, this idea is the corner stone of the TF-IDF transformation.

Word count vectors

Let me make this more precise. Assume that we have articles \(A_1, \dotsc, A_n\). The vocabulary \(\mathcal{V} = \mathcal{V}_{A_1, \dotsc, A_n}\) is the set of unique words occurring in those articles. The set of all of our articles will also be called the document. For any article \(A\), the word count function is \[ \mathrm{wc}_{A}\colon \mathcal{V} \to \mathbb{N}, \, w \mapsto \text{Number of times \(w\) occurs in \(A\)}. \]

Once we have fixed the vocabulary, it is possible to turn the word count functions into word count vectors. First, we merely need to decide on an order of the vocabulary—the alphabetic ordering is a canonical choice! As soon as we have ordered the vocabulary, we may write it as \(\mathcal{V} = \{w_1, \dotsc, w_m\}\), where \(m\) is the total number of words in the vocabulary. Finally, we declare that the word count vector of the article \(A\) is \[ v_A = \bigl(\mathrm{wc}_A(w_1), \dotsc, \mathrm{wc}_A(w_m)\bigr). \]

Normalizing word counts

Instead of only talking about single words, we can more generally deal with terms. These can also be sequences of words that appear in our documents. Depending on how long those sequences are, they may carry information about the context that cannot be inferred from single words alone. Consequently, the word count vectors from before may more generally be called term count vectors (even though this does not seem to be standard language).

In general, vectors have the advantage of being comparable by using distance measures such as the Euclidian distance. Depending on the variety of articles and the precise application, however, there might be a problem. To make this concrete, let me illustrate this by a simple artificial example. Take the sentence “I am sure I am right about this” and transform it into a word count vector. Using alphabetic ordering, you should end up with the following.

word about am I right sure this
count 1 2 2 1 1 1

Let us add a second “article” to our document. The text consists of that same sentence twice. Thus, we obtain

word about am I right sure this
count 2 4 4 2 2 2

as its word count vector.

If you want to consider these two articles as similar or even the same, you will need to normalize the vectors. This means that before comparing two word count vectors you will divide them by their length. In this case, this approach will lead to the two articles being seen as the same. However, there are also reasons for wanting to tell these two articles apart: Even if they deal with the same point, the second one puts stronger emphasis on it by repetition.

To sum it up, depending on the application you might want to think about normalization of your word count vectors. In any case, the resulting word count vectors will be called term frequencies (even if you did not normalize) from now on. This concludes the first half of the TF-IDF transformation.

Inverse document frequency

Term frequency vectors suffer from another problem: Words that certainly occur in almost every English article such as “a”, “the”, and “is” but do not carry meaning will influence the similarity of articles. There are two ways to deal with this problem (in fact, they are often used in conjunction). The first is to simply ignore those words: Create a list of so-called stop words that will be ignored when building the vocabulary. The second way is to penalize words occurring in almost every article and boost rare words. This is precisely what inverse document frequency is doing.

Before we arrive at the precise definition, let us look at it from another angle. A word count is a measure that is local to a single article. This means that it does not depend on other articles in our document. If a single word count is high in that article, this might mean that this is an important word that potentially helps characterizing the article. However, if this word count is equally high in all the other articles then this word does not help us telling this article apart from the others (if everything is special then nothing is). Thus, there is the need for a trade-off between local measures (such as a single word count of a certain article) and global measures.

Inverse document frequency is such a global measure. To define it, let us concentrate on a single word \(w\) in our vocabulary. The document frequency \(\mathrm{df}(w)\) is the number of articles that \(w\) appears in. Consequently, the inverse document frequency is \(1/\mathrm{df}(w)\).

Now we are able to describe the TF-IDF transformation as a whole: For any article \(A\) in the document, multiply each word count in its word count vector with the inverse document frequency. In formulae: \[ \mathrm{TFIDF}(A) = \left(\frac{\mathrm{wc}_A(w_1)}{\mathrm{df}(w_1)}, \dotsc, \frac{\mathrm{wc}_A(w_m)}{\mathrm{df}(w_m)}\right). \]

Applying the TF-IDF transformation (Transcript of jupyter notebook session)

Before we apply the TF-IDF transformation, it is obligatory to put aside some test data for evaluating our model later. Otherwise, a future Machine Learning model would have access to statistics of the entire dataset and may deduce statistics of the test dataset afterwards. However, the entire purpose of the train-test-split is to evaluate the model on data it has not seen before.

  import pandas as pd

  df = pd.read_json("../data/processed/data.json")
  df = df.loc[df["Procedures_Length"] > 0, [
      "Label",
      "Procedures",
      "Description",
      "Procedures_Length",
      "Description_Length",
      "Procedures_Description_Ratio"
  ]]

Making a train-test-split

With sklearn, splitting a DataFrame reduces to calling the train_test_split function from the model_selection module. The test_size argument determines the relative size of the test set.

  from sklearn.model_selection import train_test_split

  X, y = df.drop(columns=["Label"]), df["Label"]
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Note that we split up our target column Label from the rest so that it will not be included in the following transformations.

Fitting TfidfVectorizers

Since we have two text columns (Procedures and Description), it is best to fit two =TfidfVectorizer=s so that all information contained in those two separately will be preserved. The rest of the features should be scaled as certain models encounter numerical problems when two features are on very different scales (that is to say one feature usually is very large, e.g. \(\gg 10^6\), while another only attains values between 0 and 1). To do all of this in one go, sklearn provides us with a ColumnTransformer that takes a list of tuples consisting of a column name and a transformer that should transform the corresponding column. Additionally, the ColumnTransformer’s remainder keyword argument may be another transformer that will be applied to the remaining columns. Here’s how to use it:

  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.compose import ColumnTransformer
  from sklearn.preprocessing import StandardScaler


  columnwise_tfidf = ColumnTransformer(
      [
          (
              "procedures",
              TfidfVectorizer(),
              "Procedures"
          ),
          (
              "desc",
              TfidfVectorizer(),
              "Description"
          )
      ],
      remainder=StandardScaler(),
      n_jobs=-1,
  )

First, the first item in the tuple is a name for the transformation for later reference. Second, the TfidfVectorizer with standard arguments constructs the TF-IDF vectors in almost the same way that I explained it in the Blog Post accompanying this part of the project. The only difference is that the document frequency of each word is increased by one to prevent zero divisions. Third and last, the StandardScaler scales the remaining features such that they have zero mean and unit standard deviation.

Applying this ColumnTransformer to our train set follows the usual sklearn API. Each Transformer has fit and transform methods. Here, the first is used solely on the train set to fit the Transformer. Afterwards, the second may be used to transform both the train and test set.

  columnwise_tfidf.fit(X_train)
  X_train_transformed = columnwise_tfidf.transform(X_train)

Conveniently, most transformers have a fit_transform method that combines these two steps into one:

  X_train_transformed = columnwise_tfidf.fit_transform(X_train)

Extracting keywords

Let us use the fitted transformers to extract keywords from articles. First, we will extract the vocabulary as determined by the =TfidfVectorizer=s. To distinguish between the words from the Procedures and the Description, we will prepend each of them with a prefix.

  def vocabulary():
      return (
          [f"proc__{name}" for name in columnwise_tfidf.named_transformers_["procedures"].get_feature_names()]
          + [f"desc__{name}" for name in columnwise_tfidf.named_transformers_["desc"].get_feature_names()]
      )

Note that the names we have provided for the =TfidfVectorizer=s earlier now come into play.

Second, let’s write a function accepting an article and returning a DataFrame containing the words with the highest frequencies.

  def extract_keywords(article, topn=10):
      article_transformed = columnwise_tfidf.transform(article).toarray()[0]
      frequencies = list(zip(vocabulary(), article_transformed))
      frequencies.sort(key=lambda x: -x[1])
      return pd.DataFrame(frequencies[:topn])

Finally, let’s extract keywords from one of the most iconic SCP articles: The one for SCP-682. This is one of the best examples of Keter class SCPs.

  scp_682 = df.loc[df["Description"].str.startswith("SCP-682")].drop(columns=["Label"])
  extract_keywords(scp_682)
Table 6: Out[8]
  0 1
0 proc__682 0.767357
1 desc__kia 0.738121
2 desc__682 0.523255
3 desc__agent 0.171312
4 desc__personnel 0.156161
5 proc__speak 0.153737
6 proc__acid 0.144138
7 proc__to 0.133515
8 desc__pvt 0.110179
9 proc__scp 0.107281

This does not look too promising. First, maybe numbers should be ignored. Then, there are words “to”, “of” appearing in almost every article in english. “speak” might also not be telling much. This will only get worse if we look at the top 30 keywords.

  extract_keywords(scp_682, topn=30)
Table 7: Out[9]
  0 1
0 proc__682 0.767357
1 desc__kia 0.738121
2 desc__682 0.523255
3 desc__agent 0.171312
4 desc__personnel 0.156161
5 proc__speak 0.153737
6 proc__acid 0.144138
7 proc__to 0.133515
8 desc__pvt 0.110179
9 proc__scp 0.107281
10 desc__handled 0.106319
11 proc__attempts 0.098297
12 proc__reacted 0.095920
13 desc__occurrence 0.095232
14 proc__incapacitation 0.091120
15 proc__of 0.090828
16 proc__fear 0.087715
17 proc__rage 0.087715
18 proc__hydrochloric 0.085073
19 proc__massive 0.085073
20 proc__frequent 0.082915
21 proc__provoking 0.082915
22 proc__breach 0.082463
23 desc__scp 0.081648
24 proc__should 0.080923
25 proc__lining 0.079510
26 proc__called 0.078116
27 proc__incapacitated 0.078116
28 proc__force 0.078011
29 proc__destroying 0.076869

Fine-tuning the TfidfVectorizer

Fortunately, TfidfVectorizer has a lot of options to fine-tune its behavior. First and maybe most importantly, we can enforce that certain words should be ignored via the stop_words keyword argument. It either expects the string “english” and then uses a list constructed by the sklearn developers (with its own set of disadvantages) or it expects a list of strings containing the words that shall be ignored. Second, we can specify a regex pattern via the token_pattern keyword argument. This pattern will be used when parsing the articles to build up the vocabulary. The standard pattern includes single words containing letters and numbers; we will modify it to only parse for words containing letters.

  columnwise_tfidf = ColumnTransformer(
      [
          (
              "procedures",
              TfidfVectorizer(
                  stop_words="english",
                  strip_accents='unicode',
                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
              ),
              "Procedures"
          ),
          (
              "desc",
              TfidfVectorizer(
                  stop_words="english",
                  strip_accents='unicode',
                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b'
              ),
              "Description"
          )
      ],
      remainder=StandardScaler()
  )

  columnwise_tfidf.fit(X_train)

Listing 1. Out[10].

  ColumnTransformer(n_jobs=None,
                    remainder=StandardScaler(copy=True, with_mean=True,
                                             with_std=True),
                    sparse_threshold=0.3, transformer_weights=None,
                    transformers=[('procedures',
                                   TfidfVectorizer(analyzer='word', binary=False,
                                                   decode_error='strict',
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1...
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1,
                                                   ngram_range=(1, 1), norm='l2',
                                                   preprocessor=None,
                                                   smooth_idf=True,
                                                   stop_words='english',
                                                   strip_accents='unicode',
                                                   sublinear_tf=False,
                                                   token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
                                                   tokenizer=None, use_idf=True,
                                                   vocabulary=None),
                                   'Description')],
                    verbose=False)

  extract_keywords(scp_682, topn=30)
Table 8: Out[11]
  0 1
0 desc__kia 0.890278
1 proc__speak 0.272335
2 proc__acid 0.255331
3 desc__agent 0.206627
4 proc__scp 0.190041
5 desc__personnel 0.188352
6 proc__attempts 0.174127
7 proc__reacted 0.169915
8 proc__incapacitation 0.161413
9 proc__fear 0.155381
10 proc__rage 0.155381
11 proc__hydrochloric 0.150702
12 proc__massive 0.150702
13 proc__frequent 0.146879
14 proc__provoking 0.146879
15 proc__breach 0.146078
16 proc__lining 0.140847
17 proc__called 0.138377
18 proc__incapacitated 0.138377
19 proc__force 0.138192
20 proc__destroying 0.136168
21 proc__containment 0.135959
22 desc__pvt 0.132891
23 proc__difficulty 0.132345
24 proc__submerged 0.132345
25 proc__best 0.130666
26 desc__handled 0.128236
27 proc__chamber 0.126861
28 proc__plate 0.125041
29 proc__development 0.123843

This looks much better. A few remarks:

  • I had to google for the two abbreviations “kia” and “pvt”. The first is the abbreviation for “killed in action” while the second stands for the military rank “Private”.
  • On second thought, “speak” may contain the information that the SCP object is able to speak and, thusly, might hint at it being sapient. As sapient SCPs are probably more likely to be of class euclid or keter, this could be valuable information for a Machine Learning model.
  • One could start building a custom list of stop words more suitable for parsing SCP articles. In the list above, the words “best” and “called” as well as “scp” could be ignored. I will postpone this to the next part of this series of posts. Because some models give some insight in their learning process, we can use them to see if their decisions are based on filler words.

Conclusion

In this blog post, we have learned how to use Jupyter Notebooks and the pandas library to extract basic statistics from SCP articles. Furthermore, we have used a basic TF-IDF transformation to extract keywords from SCP articles.

Quickfix: jupyter nbconvert with clear-output flag not working

Jupyter comes with a command line utility jupyter nbconvert that can transform a jupyter notebook into different formats. Furthermore, it has options to tweak the output. For instance, the execute flag executes the notebook before transforming it while the clear-output flag is supposed to remove outputs from the notebook. Thus, if you want to execute the notebook notebook.ipynb and transform it into a pdf file afterwards, you can issue the command

jupyter nbconvert notebook.ipynb --execute --to pdf

I stumbled upon the following problem when I tried to create a make target for the SCP project. The make target should do the following: It should clear the output of a specified notebook, execute it and then export it as a pdf file.

Problem description

In contrast to its purpose, the clear-output flag does not remove the output from any notebook. Suppose your Jupyter notebook is notebook.ipynb. Then,

jupyter nbconvert --clear-output notebook.ipynb

merely saves the file notebook.ipynb again. The output remains in the notebook.

Solution

Unfortunately, this still seems to be an open issue. However, there is a more specific version of the command available that does exactly what we want:

jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace notebook.ipynb

It is not clear to me what the current status of the issue is; in fact, there are recent commits in other projects referencing the issue and the issue itself is labeled as upstream. Apparently, a dependency (traitlets) is causing this bug.

Classifying SCPs, Part 1: Building a web crawler

This first blog post in the series of our Data Science project of classifying SCPs is concerned with the starting point of any data-related problem, really: How do we obtain data? Fortunately, the data we want to build a Machine Learning model upon is readily available in the form of articles with standardized URLs. Thus, the task of obtaining the data comes down to writing a web crawler that parses the HTML underlying the articles and extracts the label (i.e., the Object Class) as well as the text for us.

In the following, I would like to describe my approach to this problem. Also, I will leave behind some exercises for you if you have not dealt with this kind of problem before and want to put some thought into it.

If you want to dive into the code right away and play around with the data I have created a github repository with the project code.

The post should be edible for beginning programmers that have basic knowledge of Python.

About exercises

As a mathematician, I think that you can only truly learn and understand a concept when you try it out yourself: Think of the simplest circumstance where the concept is applicable and work out the nitty-gritty details. In the process, you will have definitely learnt some of the pitfalls and specialities; others will only appear with time, when you apply the new concept to more complex environments. Likewise, I think that programming concepts can only be truly grasped when they are applied. The exercises I have created might help you with this. Some are easy refactorings solvable with standard Python constructs; others will require you to use another library and read (part of) its documentation. Both are invaluable skills as a programmer: Solving problems quickly with the programming language of your choice and trying to incorporate new libraries by reading their documentation.

How to solve the exercises

In case you want to do the exercises and quickly get to the point where you are able to tackle one, I suggest cloning the git repository and using the tags I have created precisely for this purpose. The exercise tags take the form ex-<number>, e.g. ex-1 for the first exercise. Thus, you can simply check out the commit tagged with the exercise you want to tackle and start coding. For instance, if you want to tackle the first exercise, git checkout ex-1 will get you there. After you’re done, you can compare your solution with mine by issuing git diff sol-ex-1.

Note that my solution is merely a suggestion. If you have found another one that you think might be more appropriate for the problem, feel free to leave a comment or open up an issue on github.

Also, I have provided difficulty tags (beginner, intermediate and expert) for the exercises. The beginner difficulty features exercises that hopefully will help you learn programming language features; this may consist of reading about a programming construct in the python docs and changing a keyword argument in a function. Intermediate difficulty signifies that you need to read about a certain feature in a library before being able to solve them. Finally, expert level exercises will require even more reading about library features as well as use some advanced concepts that cannot be fully explained in the text (this blog post contains one expert exercise that requires you to do some research about mocking in Python).

Do not hesitate to try out intermediate or expert level exercises even if you still feel like a beginner. Even if you are not able to solve them completely, there is much to learn from them.

Setting up our Data Science project

Before we start this project proper we first have to lay out our project structure. As announced in the overview blog post, we will use a Cookiecutter template for this purpose. First things first: If you have not installed cookiecutter yet, a simple

pip install cookiecutter

will do. It is a good idea to install cookiecutter globally. After installing cookiecutter, we will use the Data Science cookiecutter template by simply issuing the command

cookiecutter https://github.com/drivendata/cookiecutter-data-science

You will be asked a few questions about the project. If you’re not sure how to answer, hitting enter will provide a sensible default (for instance, we don’t care about setting up S3 for now).

project_name [project_name]: SCP
repo_name [scp]:
author_name [Your name (or your organization/company/team)]: Paul Grillenberger
description [A short description of the project.]: Classifying SCP articles
Select open_source_license: 1 - MIT 2 - BSD-3-Clause 3 - No license file Choose from 1, 2, 3 (1, 2, 3) [1]:
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter: 1 - python3 2 - python Choose from 1, 2 (1, 2) [1]:

After this, a folder with the project data files has been created. It is good practise to put it under version control and create an initial commit immediately, for example via

  git init && git add -A && git commit -m "Initial Commit"

To deal with dependencies for our project, we need to create a new conda environment. Fortunately, the provided makefile works with conda out of the box! Simply issue make create_environment to, you guessed it, create an environment. Afterwards, you need to use conda activate <environment> to switch to the freshly created environment. Now, to install the requirements, a simple make requirements will do (note that I’ve added some additional requirements in the repository so be sure to add those as well if you’re starting from scratch). Now we are all set up.

Our main points of interest in this folder structure will be the folders src/data/ (for data-related source code) and data/ (where the actual raw and processed data will be placed). Explore the repository and read the READ.md to get a feeling for the structure. When you’re ready, we can start to work on our project.

Rough introspection

To get an idea of what we are going to crawl on, let us take a look at the SCP Foundation Website in our browser. We are interested in the first one thousand SCPs, so we take a look at the first, say, five to get a rough idea on what each site looks like.

List of all SCP articles
List of all SCP articles

I know that number one is a special SCP that strays far from the format. The other four look similar in terms of their format, though.

Article of SCP-001
The article of SCP-001 is actually a
collection of articles that are competing suggestions
An example of a regular SCP article for our web crawler
An example of a regular SCP article, in
this case SCP-005

It’s best to skip SCP-001 because of its exceptional nature. For the others, we will take a deeper look at the HTML structure now (because that’s what the web crawler libraries see).

Detailed introspection

For my browser chrome on my mac, I hit Command-Alt-I to fire up the web developer console. Switching to the tab “Elements” yields the HTML source code of the current page. Hovering over a line of code shows what it corresponds to on the rendered browser page. Using this, we quickly find out that the content is inside a div element with the id page-content. However, most of its children are wrapped up in p elements with no distinguishing attributes. A typical layout seems to look like this:

<div id="page-content">
  <div style="text-align: right;">...</div>
  <div class="scp-image-block block-right" style="width:300px;">
    <img src="http://scp-wiki.wdfiles.com/local--files/scp-005/SCP-005.jpg" style="width:300px;" alt="SCP-005.jpg" class="image"/>
    <div class="scp-image-caption" style="width:300px;">
      <p>A close up of SCP-005</p>
    </div>
  </div>
  <p><strong>Item #:</strong> ...</p>
  <p><strong>Object Class:</strong> ...</p>
  <p><strong>Special Containment Procedures:</strong> ...</p>
  <p><strong>Description:</strong> ...</p>
  <p><strong>Addendum / Additional notes / ...:</strong> ...</p>
  <div class="footer-wikiwalk-nav">
    <div style="text-align: center;">
      <p></p>
    </div>
  </div>
</div>

Writing the web crawler

The detailed introspection suggests the following approach: For each page, find all direct child p elements. Then, get rid of the HTML. The line starting with “Object Class” contains the target label. The following text contains the data we want to predict upon.

Constructing the URL from the SCP number

Let’s say that we want to crawl the article text from each SCP whose numbers are between 2 and 2500. Then the first task is to write a small function accepting a number and spitting out the URL of the corresponding SCP article. Taking a look at the URLs for SCPs #1, #11, #111, and #1111, we see that the URL format is

http://scp-wiki.net/scp-<number>

where the number is filled up with leading zeros so that it takes up at least 3 spaces. Because I like to proceed test-driven, I create two files in the folder src/data/: A file webcrawl.py for the actual source code and a file test_webcrawl.py for tests. In webcrawl.py, let us create a prototype of our function:

def construct_url(scp_number):
    pass

In test_webcrawl.py, we create a prototype test to get us started:

from .webcrawl import construct_url


def test_construct_url():
    assert False

From the command line, issue the command pytest. As expected, pytest complains that one of our tests fails (in this case, of course, for trivial reasons):

    def test_construct_url():
>       assert False
E       assert False

src/data/test_webcrawl.py:5: AssertionError

 src/data/test_webcrawl.py ⨯                                                                                                                                                                                                                                                                                  100% ██████████

Results (0.14s):
       1 failed
     - src/data/test_webcrawl.py:4 test_construct_url

Okay, this means that our setup works. Now let us put some real assertions depending on our yet-to-write functions in there:

def test_construct_url():
    assert "http://scp-wiki.net/scp-001" == construct_url(1)
    assert "http://scp-wiki.net/scp-011" == construct_url(11)
    assert "http://scp-wiki.net/scp-111" == construct_url(111)
    assert "http://scp-wiki.net/scp-1111" == construct_url(1111)

This time, pytest complains because our function does not do what we expect it to do yet:

    def test_construct_url():
>       assert "http://scp-wiki.net/scp-001" == construct_url(1)
E       AssertionError: assert 'http://scp-wiki.net/scp-001' == None
E        +  where None = construct_url(1)

src/data/test_webcrawl.py:5: AssertionError

 src/data/test_webcrawl.py ⨯                                                                                                                                                                                                                                                                                  100% ██████████

Results (0.09s):
       1 failed
         - src/data/test_webcrawl.py:4 test_construct_url

In test-driven development, this means we are in phase “RED” now: We have written a test that tells us exactly when we have established our desired functionality. Our target is to get to phase “GREEN” as quickly as possible. That means we can finally write some code. To fill up a given integer with zeros to at most three spaces, we can use elementary python String formatting:

BASE_URL = "http://scp-wiki.net/"
SCP_ROUTE_TEMPLATE = "scp-{number:03d}"


def construct_url(scp_number):
    return BASE_URL + SCP_ROUTE_TEMPLATE.format(number=scp_number)

Running pytest afterwards tells us that our one test has passed. We are in phase “GREEN” now. We can now safely refactor our code until we are satisfied with it. Whenever we make changes and let the tests run, we can be confident that our code still works as expected. Sometimes, this is called the “REFACTOR” phase of TDD. I will leave this phase to you in the following exercises.

Exercises

  • Git Tag: ex-1   beginner

    Get rid of the global variables BASE_URL and SCP_ROUTE_TEMPLATE and use f-Strings to refactor construct_url. Be sure to let the tests run afterwards to see if you still get the desired outcome.

  • Git Tag: ex-2   beginner intermediate

    In my opinion, it is perfectly fine to violate the DRY (Don’t repeat yourself) principle when writing tests to keep them simple. However, pytest provides some decorators that help us generate test cases when we simply want to check on certain function outputs with varying inputs. Use the pytest.mark.parametrize decorator to basically turn our test into a one-liner.

Filtering for the page content

After having constructed the URL, the logical next step would be to use it and request the data from the server. Fortunately, the requests library solves this issue for us. A simple call to requests.get will do. Even so, we do not need every information that a call to requests.get returns (we do not need header data from the response, we do not need the html header data…). Thus, our task will be to use the BeautifulSoup library to filter everything within the div element with the id “page-content”. To test if we obtain the correct data, let us first write a main function that will serve as the entry point to our script.

import argparse
import requests

# construct_url as before...

def crawl_for(scp_number):
    url = construct_url(scp_number)
    response = requests.get(url)
    content = response.text
    return content

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--number",
        type=int,
        dest="scp_number",
        default=2,
        help="Number of the SCP article to obtain.",
    )
    args = parser.parse_args()
    print(crawl_for(args.scp_number))

A call to the get function of the requests library returns a Response object whose content can be retrieved via the text attribute. If the script is called via the command line, we use the argparse module to parse command line arguments. In this case, we accept an optional argument --number that defaults to 2. If you call the webcrawl.py from the commandline now, the whole HTML from SCP #2 should be printed out. However, as mentioned in the introduction, we are only interested in the children of a certain div element.

To go on in a test-driven manner, we wrap a prototype function around the response.text and write a test for it.

def filter_for_page_content(page):
    pass


def crawl_for(scp_number):
    url = construct_url(scp_number)
    response = requests.get(url)
    content = filter_for_page_content(response.text)
    return content
from .webcrawl import construct_url, obtain_raw_content


TEST_PAGE = """
    <html>
    <head>
    <title>Some scp</title>
    </head>
    <body>
    <div id="page-content">
        <p>Some paragraph.</p>
    </div></body></html>
    """

# test_construct_url omitted...

def test_filter_for_page_content():
    expected = """
<div id="page-content">
<p>Some paragraph.</p>
</div>
    """.strip()
    actual = str(filter_for_page_content(TEST_PAGE))
    assert expected == actual

Of course, this is a only basic test but sufficient for our purposes. More concretely, we want to make sure that the function extracts precisely the content from the HTML that we care about, namely the div element with the id page-content and its children. Because we have not written any code yet, pytest should signal that we are in phase RED again. Now BeautifulSoup enters the picture. The main entry point to web crawling is the BeautifulSoup object of the bs4 module. Its constructor takes the raw content. The resulting instance has a find method that can be used to find the first element with a certain name and attributes – and that’s precisely the functionality we need! The function implementation comes down to this:

# Other imports...
from bs4 import BeautifulSoup

# construct_url omitted...

def filter_for_page_content(page):
    return BeautifulSoup(page).find(name="div", attrs={"id": "page-content"})

Our tests should pass again. However, pytest gives us a few warnings that we will deal with in the exercises.

Exercises

  • Git Tag: ex-3   beginner

    Add the features keyword argument to the constructor of the BeautifulSoup object and assign the value "html.parser" to it to shut down the warnings. Read the doc-string of the BeautifulSoup object to find out why this may be useful. Note that you might still encounter another warning concerning the import of “ABCs” from collections instead of importing collections.abcs. At the time of this writing, this seems to be an issue with the BeautifulSoup library itself that we can do nothing about.

  • Git Tag: ex-4   intermediate

    Use the click library instead of argparse to parse the command line arguments. In the repository, the file src/data/make_dataset.py contains a good template if you get stuck. Note that you may have to move the print statement to the crawl_for function and use the echo() function instead.

Splitting the filtered content into the label and the text

After having completed two straightforward tasks, let us come to the heart of our problem. We have extracted the main part of an SCP article and want to split it into the object class of the underlying SCP and the article text. Before we think about a solution to this problem, let us implement a prototype function.

def split_into_label_and_text(raw_text):
    pass

Then, let us write a test. Because this might not be straightforward, let me spell out my thoughts here. The typical input of the split_into_label_and_text function is a BeautifulSoup object containing all children of the div element with the id page-content. In particular, this BeautifulSoup object might contain a div element containing an image and it contains a div element containing the footer with the links to the previous and the next SCP article. What I want the function to do is the following:

  1. It should return a tuple. The first element should be the label (i.e. the object class), the others should be the p elements containing the Object number, the Special Containment Procedures, the description, and, if present, any addenda.
  2. The label should be in upper case.
  3. The image and the footer should not be returned by the function.

    Having worked out these requirements, a simple test case is not hard to pull off. We can use the typical SCP HTML structure that we have worked out in the detailed introspection as a template and boil it down a little. Here’s what I came up with.

    from bs4 import BeautifulSoup
    # The other imports and function remain untouched...
    
    def test_split():
        test_content = BeautifulSoup(
            """
            <div class="image-content">
                <p>Some caption</p>
            </div>
            <p><strong>Item #:</strong> SCP-xxx</p>
            <p><strong>Object Class:</strong> Safe</p>
            <p><strong>Special Containment Procedures:</strong> ...</p>
            <p><strong>Description:</strong> ...</p>
            <p>Other...</p>
            <div class="footer">
                <p>Links to other SCPs...</p>
            </div>
            """,
            features="html.parser",
        )
        actual_label, actual_content = split_into_label_and_text(test_content)
        expected_label = "SAFE"
        expected_content = [
            "<p><strong>Item #:</strong> SCP-xxx</p>",
            "<p><strong>Special Containment Procedures:</strong> ...</p>",
            "<p><strong>Description:</strong> ...</p>",
            "<p>Other...</p>",
        ]
        assert expected_label == actual_label
        assert expected_content == [str(p) for p in actual_content]
    

    Note that the test_content contains both a div element containing an image and another div element mocking footer data. As you can see in the list expected_content, I do not want these to be returned by the function. As is expected, this test will fail, simply because the returned None value cannot be split into an actual_label and an actual_content.

    Unfortunately, BeautifulSoup cannot help us directly to implement this function because the object class is inside a p element without any distinguishing properties. The only safe way to obtain it is to search the text for the first occurrence of the string “Object Class”. Here’s my implementation.

    def split_into_label_and_text(raw_text):
        paragraphs = raw_text.find_all("p")
        obj_class_p = next(p for p in paragraphs if "Object Class" in p.get_text())
        paragraphs.remove(obj_class_p)
        label = obj_class_p.contents[-1].strip().upper()
        return label, paragraphs
    

    A lot is happening in those five lines so let me guide you through them step by step.

    1. We use the find_all method to obtain a list of all p elements.
    2. The expression p for p in paragraphs is a generator expression that lazily gives us the elements of the paragraphs list that satisfy the condition if "Object Class" in p.get_text(). The built-in function next() evaluates the generator once and thusly gives us the first p element containing the string “Object Class”.
    3. We remove the p element containing the object class from the list.
    4. Finally, to obtain the label transformed to uppercase, we use the contents attribute that is a list of the children of the BeautifulSoup object to obtain the last element (index -1). Because the string “Object Class” itself is contained in a strong element, this will give us the label. The strip and upper methods are built-in methods of the string class.
    5. We return a tuple.

    This implementation still lets the tests fail. The reason is that we return all p elements as the second tuple element, even the mocked image caption and the footer data. The solution is to only look for the direct children that are p elements. This will be implemented in the exercises.

Exercises

  • Git Tag: ex-5   beginner

    Use the recursive argument of the find_all method to let the tests pass.

  • Git Tag: ex-6   beginner

    Update the crawl_for method so that it uses the freshly-implemented split_into_label_and_text function and print out the label and the paragraphs.

Writing the results to a text file

After we have obtained the label and the text of the article we have to merely persist this data so that it can be analyzed and transformed later. The easiest way is to write each article to a text file, where the first line would be the label.

Additionally, we will add a new click.argument to our command line script that allows us to submit a file path where the articles should be saved. If you have not done the refactoring exercises yet the following code samples will contain spoilers.

Here’s how it could go.

@click.command()
@click.argument("filepath", type=click.Path(exists=True))
@click.option("--number", default=2, help="Number of the SCP article to obtain.")
def crawl_for(number, filepath):
    url = construct_url(number)
    response = requests.get(url)
    content = filter_for_page_content(response.text)
    label, paragraphs = split_into_label_and_text(content)
    with open(os.path.join(filepath, f"scp-{number:03d}"), "w") as f:
        f.write(label + "\n")
        for paragraph in paragraphs:
            f.write(paragraph.get_text() + "\n")

Now everything falls into place. If you are in the root directory of the project repository, a simple python src/data/webcrawl.py data/raw/ will write the contents of the article about SCP-002 into the text file data/raw/scp-002.txt. Because we do not want to type in this command a thousand times, it remains to refactor the crawl_for function to accept a range of numbers of SCP articles whose contents should be crawled.

Exercises

If you are a fan of test-driven development, you might have wondered why I simply added the last lines of code without providing a test. Also, you might wonder why there are no tests for the crawl_for function. The reason is that both depend on external resources or libraries (the last lines of code depend on a writable directory and the crawl_for function uses the requests library to fetch data from the internet). There are solutions for these kinds of problems but they might distract a little from the main task so that I have decided to put them into exercises (ex-8 and ex-9).

  • Git Tag: ex-7   beginner

    Remove the --number option and supply two options --lower and --upper. These should be the lower and upper bounds of the numbers of SCP articles the command line script should fetch. Remember to provide sensible default values as well as a help text.

  • Git Tag: ex-8   intermediate

    Refactor the last three lines of the crawl_for function into another function that accepts a file object (in the crawl_for function, this is the f variable). Test this function by providing an appropriately prepared StringIO object.

  • Git Tag: ex-9   expert

    Add a test for the crawl_for function by substituting the calls of the get function of the requests library and the open built-in function with appropriate Mock objects. Here are a few pointers:

A short digress on logging

If you let the web crawler run with default arguments (I chose to crawl every SCP between 2 and 1000) the script will fail miserably after a few seconds.

~/Data Science Projects/scp [tags/sol-ex-9] λ python src/data/webcrawl.py data/raw/
Traceback (most recent call last):
  File "src/data/webcrawl.py", line 47, in <module>
    crawl_for()
  File "/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/paul/anaconda/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "src/data/webcrawl.py", line 41, in crawl_for
    label, paragraphs = split_into_label_and_text(content)
  File "src/data/webcrawl.py", line 20, in split_into_label_and_text
    obj_class_p = next(p for p in paragraphs if "Object Class" in p.get_text())
StopIteration

Apparently, there is a problem with finding the Object Class but it is not immediately apparent what that problem could be. Looking into the folder data/raw, we see that only seven SCP articles have been written to text files. Thus, the problem occurred when crawling the eighth article. Taking a look at the article page of SCP-008, we see that it is behind a javascript wall; the article itself is only generated when a certain link is followed. Thus, our program stops working because it cannot find the p element with the Object Class in it.

Our web crawler has problems with this javascript link
The article of SCP-008 sits behind this javascript link
The proper article of SCP-008 after clicking the javascript link
Hitting the javascript link reveals the proper article

This is an example of a problem that occurs in programming all the time. You have a certain expectation and build your program with that expectation in mind; however, your expectation differs ever-so-slightly from reality and all of a sudden your program stops working. Tests will help you make sure that your program works according to your expectations but they won’t help you when your expectations are off. Additionally, analyzing why your program stopped working can become a huge hassle.

Fortunately, there is a general tool that will help you make your program much more communicative and your analysis much easier: Logging.

Test for the expected, log for the unexpected.

Logging not only helps you in cases where something is wrong. It may also help you make the user experience a little better. From the very moment your script starts the user does not get any feedback until an exception is thrown. However, your user might be interested in what SCP article the program is currently dealing with because they might want to know how long it takes until it is finished.

A short introduction to the python logging module

The builtin logging module helps us with both needs. A python logger has three vital parts:

  1. A name. This can be used to configure a logger from within a configuration file. Usually, a logger gets its module it is constructed in as its name. This implies a hierarchical structure: If you wish to activate loggers for a certain module, the loggers in submodules get activated as well – assuming they adhere to this naming convention.
  2. A Logging Level. This tells you what kind of messages the logger will filter out and what messages it lets through. In Python, there are the levels critical, error, warning, info, debug, and notset, sorted from very specific about the messages it lets through to letting through every message. Usually, you will use the debug level if you want to leave behind some breadcrumbs for, you guessed it, debugging to yourself in the future or some colleague, the info level for general information about where your program is and what its state is, the warning level for logging program states that are unusual and might hint at a problem arising in the near future and the error and critical level for exceptions and critical conditions that will hinder your program from working successfully.
  3. One or more handlers. A handler defines how to deal with logging messages – should they be printed out to the console, or should they be written to a log file? A handler has its own logging level and thus is able to filter out specific logging messages. Also, it defines a logging format (through a formatter) that can give additional information.

We will define a logger with the module name (the magic name attribute) and three handlers: One handler (console_handler) that logs info messages to the console (thus, the user will at least experience where the program is at), and two FileHandlers that log debug messages and warnings to two separate files. The warning log will help us quickly identify that something went wrong while the debug log will give us detailed information why.

import logging


logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
std_formatter = logging.Formatter(
    "[%(asctime)s][%(levelname)-5s][%(threadName)s] - %(message)s"
)
warn_file_handler = logging.FileHandler("warnings.log")
warn_file_handler.setFormatter(std_formatter)
warn_file_handler.setLevel(logging.WARN)
debug_file_handler = logging.FileHandler("debug.log")
debug_file_handler.setFormatter(std_formatter)
debug_file_handler.setLevel(logging.DEBUG)
console_handler = logging.StreamHandler()
console_handler.setFormatter(std_formatter)
console_handler.setLevel(logging.INFO)
logger.addHandler(console_handler)
logger.addHandler(warn_file_handler)
logger.addHandler(debug_file_handler)

The logger definition itself is pretty useless until we emit logging messages. I have decided to put a few info messages at the beginning and the end of our script as well as some debug messages that log the content of intermediate results. Furthermore, I added a warning message if the label we have identified is not one of SAFE, EUCLID or KETER.

# click decorators omitted...
def crawl_for(lower, upper, filepath):
    logger.debug(
        "Called with lower = %s, upper = %s, filepath = %s", lower, upper, filepath
    )
    for number in range(lower, upper):
        logger.info("Crawling number %d", number)
        url = construct_url(number)
        logger.debug("URL: %s", url)
        response = requests.get(url)
        logger.debug("Response: %s", response.text)
        content = filter_for_page_content(response.text)
        logger.debug("Content: %s", content)
        label, paragraphs = split_into_label_and_text(content)
        logger.info("Identified label %s", label)
        logger.debug("Paragraphs: %s", paragraphs)
        if label not in ("SAFE", "EUCLID", "KETER"):
            logger.warn("Unknown label %s for number %d", label, number)
        with open(os.path.join(filepath, f"scp-{number:03d}.txt"), "w") as f:
            write_to(f, label, paragraphs)


if __name__ == "__main__":
    logger.info("Start webcrawling...")
    crawl_for()
    logger.info("End webcrawling...")

Letting the script run now, we get a nice status message for each number and immediately see that something fails when crawling the article for SCP-008:

[2019-11-13 15:03:19,796][INFO ][MainThread] - Crawling number 7
[2019-11-13 15:03:20,398][INFO ][MainThread] - Identified label EUCLID
[2019-11-13 15:03:20,398][INFO ][MainThread] - Crawling number 8
Traceback (most recent call last):
  File "src/data/webcrawl.py", line 78, in <module>
    crawl_for()
(Rest of traceback omitted)

For the moment, I will simply catch the exception, emit an error logging message and continue the for loop.

# Outer for loop ommitted...
        logger.debug("Content: %s", content)
        try:
            label, paragraphs = split_into_label_and_text(content)
        except Exception:
            logger.exception("Exception when splitting for number %d", number)
            continue
        logger.info("Identified label %s", label)

As a closing remark for this section, I would like to mention that logging ultimately comes down to personal preference, be it your own or that of your team. Arguably, a plethora of logging calls may pollute your code and obfuscate the meaning. It can be hard to find the right balance – that only comes with experience.

Exercises

  • Git Tag: ex-10   beginner

    Add a debug message to the split_into_label_and_text function that logs the content of the paragraphs variable.

  • Git Tag: ex-11   intermediate expert

    While the logging configuration is necessary, it also pollutes the program: A reader of the source code has to wade through 15 lines of code detailing the logging configuration when they might simply want to find out how the internal logic of the program works. Therefore, use the fileConfig to put the logging configuration into a file. Here are a few hints that help you avoid some pitfalls I stumbled upon when building my configuration file:

    • First, the section on the file format in the docs contains a few examples that should help you on your journey.
    • Second, the script will be called from the root directory of the project. As a result, it is feasible to also put the logging config into the root directory.
    • Third, since we will be calling the script directly via the command line, the __name__ attribute will equal "__main__". I suggest configuring the root logger with the three handlers and a debug log level and one additional logger as follows.

      [logger_main]
      level=DEBUG
      propagate=1
      handlers=
      qualname=__main__
      

      The flag propagate will emit every log message to parent loggers. Because the root logger, as its name suggests, is a parent of every logger the handlers of the root logger will deal with the log messages emitted by the main logger – even though the main logger does not define any handler itself.

    • Finally, it is possible that we will modify the logging config in the future. Also, your logging preferences might differ from mine. Consequently, I only committed a file logging_config_template.ini and added it to version control and put the logging_config.ini into .gitignore.

    Note that it is also possible to configure logging with a yaml file, parse it with the library pyyaml and feed the resulting dictionary into dictConfig. The plain fileConfig is older and does not support every feature that dictConfig does so that this seems to be the new best practice for configuring logging via a file.

Speeding it up with threading

We could let the script run now and it would fetch the first one thousand SCP articles for us; however, it will take some time. On my machine, each crawl for a single SCP article takes about 600 ms. That is to say, a thousand crawls will take about 600 seconds which is 10 minutes. Analysing the timestamps in the debug.log, it seems that most time is spent waiting for the GET request to deliver the data.

Multi-threading and multi-processing in Python

Coming from JAVA, a language that naturally supports concurrency, I was surprised to learn that Python distinguishes between multi-threading and multi-processing. This is due to the global interpreter lock that assures “that only one thread executes Python bytecode at a time.” To clarify, multi-threading in Python refers to concurrent programming in a single processor while multi-processing distributes over different processor kernels. As a rule of thumb, multi-threading is useful when you want to make IO-heavy tasks (such as waiting for request responses and reading from or writing to files) concurrent. For computation-heavy tasks (such as solving equations, training Machine Learning models…), stick to multi-processing.

Implementing multi-threading via ThreadPoolExecutor

Using the tools in the concurrent.futures module, turning a for loop concurrent can be done using an easy-to-follow pattern. I would like to call it the Concurrency Refactoring Pattern or CRP for short.

  1. Refactor everything in the for loop into a function accepting the variable that is iterated over as its argument.
  2. Replace the for loop with a with statement initialising a ThreadPoolExecutor.
  3. Replace the call of the new function with a call to the map method of the initialised executor with the new function and the iterable that was iterated over in the for loop as its arguments.

To make this pattern clearer, here are code samples illustrating the CRP.

for x in it:
    do_something(x)
with ThreadPoolExecutor(max_workers=64) as executor:
    do_something(x)
with ThreadPoolExecutor(max_workers=64) as executor:
    executor.map(do_something, it)

Even though the amount of code does not differ that much, a lot is happening in these two lines. First, the ThreadPoolExecutor is initialised with a maximum of 64 workers (threads). Think of this ThreadPoolExecutor as a manager that gives the workers something to do. In addition, it manages the case where there is not enough work to do for the amount of workers that we requested (imagine the case when we only want to obtain 60 SCP articles but we initialised the ThreadPoolExecutor with 64 max_workers – in this case, only 60 threads would be started). Second and last, the map method initiates the distribution of work among the workers. It accepts a function and an iterable as its arguments; here, the iterables will be called to obtain function arguments that the workers should feed into the function.

In our case, the situation is slightly more complicated as our function will depend on two arguments: the filepath and the number. Even though the filepath does not change in the for loop we still have to create an iterable with the same length as the range we are iterating over. Here’s how it will turn out.

# other imports unchanged...
from concurrent.futures import ThreadPoolExecutor

# other functions unchanged ...
def crawl(filepath, number):
    logger.info("Crawling number %d", number)
    url = construct_url(number)
    logger.debug("URL: %s", url)
    response = requests.get(url)
    logger.debug("Response: %s", response.text)
    content = filter_for_page_content(response.text)
    logger.debug("Content: %s", content)
    try:
        label, paragraphs = split_into_label_and_text(content)
    except Exception:
        logger.exception("Exception when splitting for number %d", number)
        return
    logger.info("Identified label %s", label)
    logger.debug("Paragraphs: %s", paragraphs)
    if label not in ("SAFE", "EUCLID", "KETER"):
        logger.warn("Unknown label %s for number %d", label, number)
    with open(os.path.join(filepath, f"scp-{number:03d}.txt"), "w") as f:
        write_to(f, label, paragraphs)


# click decorators ommitted...
def crawl_for(lower, upper, filepath):
    logger.debug(
        "Called with lower = %s, upper = %s, filepath = %s", lower, upper, filepath
    )
    with ThreadPoolExecutor(max_workers=64) as executor:
        executor.map(
            crawl, (filepath for _ in range(lower, upper)), range(lower, upper)
        )

As you can see, you should supply the map method as many iterables as your function has arguments.

Exercises

  • Git Tag: ex-12   beginner

    Turn the max_workers number 64 into a click option.

Clean up

After having run the web crawler, watching the 64 threads delightfully crunching through the pages and punching them into text files, it is time to take a look at the results. There are quite a few warnings and errors logged into our warnings.log. Let’s take a look at them and see if we have to modify the web crawler and re-run it once more.

Errors: SCP pages we were not able to crawl correctly

Using the warnings.log, we can estimate how many errors occurred.

grep "\[ERROR\]" warnings.log | wc -l

Here, it pays off that we incorporated the log level into our log format. Note that we have to escape the square brackets with backslashes because they have a special meaning in regular expressions. In my run, I got 12 error log messages. Taking a closer look at them, we see that there are quite a few SCPs that have a javascript wall in front them. For instance, we already know about SCP-8. Others have a slightly different HTML structure: SCP-285 has another div element wrapped around the p elements with the content we are interested in. I plan on ignoring all of them for the moment.

Warnings: Unknown labels

Using the warnings.log, we can estimate how many of the crawled SCPs have been assigned an unexpected label. A quick grep combined with the word count utility comes to the rescue:

grep "Unknown label" warning.log | wc -l

For my run with default bounds this yields 56 unknown labels. Closer inspection shows that the majority is not an unknown label but a known label with further information. For instance, SCP-417 is classified as Euclid but the author wanted to note that it could be potentially Keter. Furthermore, there are a few SCPs that apparently have been assigned a finer classification. For example, SCP-66 is classified as “euclid-impetus” and SCP-625 as “euclid-flecto”. Because the majority of the SCPs is not classified this way, I plan on only using the coarse label. The truly unexpected labels are the following:

  • None (48)
  • Thaumiel (179, 378)
  • Neutralized (356, 407, 541, 696, 821, 818)
  • Scarf (586)

For the neutralized ones, a few of them have a previous assigned label such as SCP-818. I could take the former label into account but since we are only talking about a hand full of data points here, I plan on ignoring them altogether. The “Scarf” one is interesting. Apparently, the underlying SCP causes writers to make typos when writing about it. I suppose that the real label should be “Safe”. The SCP belonging to the “None” label seems to be a placeholder. There are also a few (expected) labels with a leading colon, for instance for SCP-75. Apparently, this is caused by the colon not being inside the strong element. This can be fixed with not too much hassle so let’s do it right now.

  • Fixing the “leading colon label” bug

    First, let’s write a test reproducing the bug by copying our test_split method and moving the colon behind the “Object class” out of the strong element:

        def test_split_with_leading_colon(self):
            test_content = BeautifulSoup(
                """
                <div class="image-content">
                    <p>Some caption</p>
                </div>
                <p><strong>Item #:</strong> SCP-xxx</p>
                <p><strong>Object Class</strong>: Keter</p>
                <p><strong>Special Containment Procedures:</strong> ...</p>
                <p><strong>Description:</strong> ...</p>
                <p>Other...</p>
                <div class="footer">
                    <p>Links to other SCPs...</p>
                </div>
                """,
                features="html.parser",
            )
            actual_label, actual_content = split_into_text_and_label(test_content)
            expected_label = "KETER"
            expected_content = [
                "<p><strong>Item #:</strong> SCP-xxx</p>",
                "<p><strong>Special Containment Procedures:</strong> ...</p>",
                "<p><strong>Description:</strong> ...</p>",
                "<p>Other...</p>",
            ]
            self.assertEqual(expected_label, actual_label)
            self.assertEqual(expected_content, [str(p) for p in actual_content])
    

    To make the tests a little more diverse, I also changed the label from “Safe” to “Keter”. Running the tests should get you precisely one fail:

    AssertionError: 'KETER' != ': KETER'
    - KETER
    + : KETER
    

    The easy way to fix it would be to simply do a string replace on the label inside the split_into_text_and_label function:

    label = obj_class_p.contents[-1].strip().upper().replace(": ", "")
    

    Our tests should be green again. This reduced the unexpected label warnings to 41. We could also make the web crawler deal with labels such as “euclid-impetus” and only write the coarser label to the text file. However, I plan on leaving that to the data transformation blog post.

Preparing for the next step: Editing make targets

The Data Science cookiecutter template defines several make targets that will be useful in the next blog post. Using the make command line utility allows us to execute quite complex command line scripts such as our web crawler with a simple API. Also, it lets us define dependencies such as “only run this task if this source code file changed.”

The make utility is configured via a Makefile. One is already present in the project and for instance defines a clean target that deletes all compiled Python code (that is, it deletes all __pycache__ directories and files ending in .pyc or .pyo). This clean target is executed via make clean. In the Makefile, let’s also add that log files should be cleaned up.

## Delete all compiled Python files and log files
clean:
	find . -type f -name "*.py[co]" -delete
	find . -type f -name "*.log" -delete
	find . -type d -name "__pycache__" -delete

Now, whenever you execute make clean, all log files will be deleted. Furthermore, we will add a new target (under “PROJECT RULES”) that will execute the web crawler.

data/raw: src/data/webcrawl.py
	$(PYTHON_INTERPRETER) src/data/webcrawl.py data/raw

Note that this target has a dependency. Namely, it depends on the file src/data/webcrawl.py. What make does is the following: It checks whether the date when the file webcrawl.py has been changed is younger than the date when the directory data/raw has been changed. If so, it executes the following tasks. Otherwise, it will tell you that the target is up-to-date.

Finally, we add the target data/raw as a dependency to the data target.

## Make Dataset
data: requirements data/raw
	$(PYTHON_INTERPRETER) src/data/make_dataset.py data/raw data/processed

The data target is a template from the Data Science project. It will be implemented next time when we are dealing with data preprocessing and transformations.

Exercises

  • Git Tag: ex-13   beginner

    The data/raw directory need not exist after having cloned the repository. Edit the data/raw target in the Makefile so that it will be created.

  • Git Tag: ex-14   intermediate

    Add a new target logging_config.ini. Executing this target should copy the file logging_config_template.ini to logging_config.ini. Furthermore, add a new phony target, i.e. a target that does not correspond to a file name, called setup that does not execute any additional actions but depends on the targets logging_config.ini and create_environment.

Conclusion

We have come quite a long way in this blog post. In more detail, you have learnt how to:

  • Write code in a test-driven manner using pytest,
  • Set up logging using the builtin logging module,
  • Implement multi-threading using the builtin concurrent.futures module,
  • Use the requests library to issue a GET request,
  • Make use of the BeautifulSoup library to parse HTML, and
  • Read a Makefile and use make targets.

Hopefully, I could demonstrate how to use cookiecutter templates in general and, more specifically, how to use the Data Science template.

Further reading

I have linked to the documentations of the libraries we have used throughout. However, if you want to take an even deeper dive into some of those topics, I suggest the following.

  • Automate the Boring Stuff with Python by Al Sweigart is a beginner-friendly introduction to automation scripts with Python. It gives you step-by-step instructions for your scripts as well as further projects to work on. In particular, I would recommend the eleventh chapter Web Scraping for strengthening your understanding of web crawling and working with requests, BeautifulSoup and related libraries I have not mentioned in this blog post.
  • Effective Python by Brett Slatkin gives you an overview over best practices of different topics. In particular, I would recommend the fifth chapter Concurrency and Parallelism if you would like to strengthen your understanding on multi-threading and -processing.
  • The examples on how to work with make on the Cookiecutter Data Science page are helpful learning resources.