After we have obtained data through the means described in the first part of this blog post series, it is time to deal with data transformations and data preprocessing. While humans can comprehend textual information in the form of articles, it is hard to digest for a Machine Learning algorithm. In this blog post, we will transform the textual information into certain vectors assigning a number to each word in the vocabulary of the set of articles: This is what the TF-IDF (Term Frequency - Inverse Document Frequency) is all about.

In comparison to the web crawler post, this one is more mathematical in nature. Instead of evaluating technical approaches and executing them in a test-driven manner, we will have to understand the mathematical background behind the algorithm to put it to good use.

To make the transition as gentle as possible, let us do a warm-up that is closer to the technical spirit of the last blog post: We use the text files produced by the web crawler from the last blog post to extract certain lengths of paragraphs and research in what way these help us to determine the Object Class of the article.

After we have understood how TF-IDF works, we can use it to transform our articles into TF-IDF vectors. Consequently, we will already be able to extract keywords from each article.

Warm-up: Extracting the paragraph lengths

To structure the text from the articles, let us use a custom class.

Basic version of the Article class

1class Article(object):
2    def __init__(self, label, name, procedures, desc):
3        self.label = label.strip()
4        self.name = name.strip()
5        self.procedures = procedures.strip()
6        self.desc = desc.strip()
src/data/article_data.py

The logic that splits up the text from the text files into attributes of the class will be a classmethod that accepts a list of lines of text and returns a readily constructed Article instance.

 1class Article(object):
 2    # __init__ omitted...
 3
 4    @classmethod
 5    def from_text(cls, lines):
 6        label, *text = lines
 7        text = "".join(text)
 8        name, rest = text.split("Special Containment Procedures:")
 9        procedures, desc = rest.split("Description:")
10        return cls(label, name, procedures, desc)
src/data/article_data.py

Here's a basic test that shows how to use the classmethod to obtain an Article instance.

 1from src.data.article_data import Article
 2
 3
 4def test_from_text():
 5    procedures = [
 6        "Special Containment Procedures: Something...",
 7        "Something part two...",
 8    ]
 9    description = "Description: Something else..."
10    article = Article.from_text(["SAFE", "Some name   ", *procedures, description])
11    assert article.label == "SAFE"
12    assert article.name == "Some name"
13    assert article.procedures == "Something...Something part two..."
14    assert article.desc == "Something else..."
tests/data/test_article_data.py

Validation of the label through a @property

As mentioned in the last part of this series, we are only concentrating on the labels SAFE, EUCLID and KETER. To account for this, we need to validate that the incoming label is one of those. We are a little more lenient and also accept labels that only start with one of those three labels.

Let us write tests first to define the desired behavior.

 1import pytest
 2from src.data.article_data import Article
 3
 4
 5@pytest.fixture
 6def article():
 7    return Article("SAFE", "", "", "")
 8
 9
10@pytest.mark.parametrize("label", ["SAFE", "EUCLID", "KETER"])
11def test_set_regular_label(article, label):
12    article.label = label
13    assert article.label == label
14    article.label = label + "SOMETHING"
15    assert article.label == label
16
17
18def test_set_unknown_label(article):
19    with pytest.raises(ValueError) as excinfo:
20        article.label = "unknown"
21    assert "unknown" in str(excinfo)
tests/data/test_article_data.py

In the tests above, we are using a fixture that gives us an initialized Article instance. Then, we are defining the regular behavior of the setter (we are expecting the label to accept the three main object classes as well as labels that start with those) and what happens when the setter encounters an unknown label (we are expecting a ValueError, enforced via the raises helper).

Because we have not written any validation for the label attribute yet, the tests fail. To account for these kinds of validations, Python has @property decorators that allow for custom getter and setter methods.

 1class Article(object):
 2
 3    ALLOWED_LABELS = ("SAFE", "EUCLID", "KETER")
 4
 5    # __init__ omitted...
 6
 7    @property
 8    def label(self):
 9        return self._label
10
11    @label.setter
12    def label(self, orig_label):
13        labels = [
14            label for label in self.ALLOWED_LABELS if orig_label.startswith(label)
15        ]
16        if not labels:
17            raise ValueError(f"Unknown label '{orig_label}'!")
18        self._label = labels.pop()
src/data/article_data.py

The Python interpreter calls the method decorated with @label.setter as soon as it encounters the line self.label = label in the __init__ method. As a result, code that uses this class has to deal with ValueErrors when constructing instances.

Adding a to_dict method

While the Article class is responsible for extracting information from the articles, it is much easier to use a simple dictionary when persisting extracted information. That is because the json library can serialize Python dictionaries directly; additionally, the pandas Data Science library is able to use dictionaries to construct their main object: a DataFrame. As a result, we need to write a to_dict method that turns an Article instance into a plain dictionary. Aside from the four attributes of the Article class, we also require the dictionary to contain the (character) lengths of the Procedures and the Descriptions as well as the Ratio of these two lengths.

 1def test_to_dict_trivial_article(article):
 2    d = article.to_dict()
 3    assert "Label" in d
 4    assert d["Label"] == "SAFE"
 5    assert "Name" in d
 6    assert "Procedures" in d
 7    assert "Description" in d
 8    assert "Procedures_Length" in d
 9    assert d["Procedures_Length"] == 0
10    assert "Description_Length" in d
11    assert d["Description_Length"] == 0
12    assert "Procedures_Description_Ratio" in d
13    assert d["Procedures_Description_Ratio"] == 0
14
15
16def test_to_dict(article):
17    article.name = "Test"
18    article.procedures = "TestTest"
19    article.desc = "Test"
20    d = article.to_dict()
21    assert "Label" in d
22    assert d["Label"] == "SAFE"
23    assert "Name" in d
24    assert d["Name"] == "Test"
25    assert "Procedures" in d
26    assert d["Procedures"] == "TestTest"
27    assert "Description" in d
28    assert d["Description"] == "Test"
29    assert "Procedures_Length" in d
30    assert d["Procedures_Length"] == 8
31    assert "Description_Length" in d
32    assert d["Description_Length"] == 4
33    assert "Procedures_Description_Ratio" in d
34    assert d["Procedures_Description_Ratio"] == 2
tests/data/test_article_data.py

The implementation is straightforward and uses a dictionary comprehension.

 1    def to_dict(self):
 2        return {
 3            "Label": self.label,
 4            "Name": self.name,
 5            "Procedures": self.procedures,
 6            "Description": self.desc,
 7            "Procedures_Length": len(self.procedures),
 8            "Description_Length": len(self.desc),
 9            "Procedures_Description_Ratio": len(self.procedures) / len(self.desc)
10            if len(self.desc) > 0
11            else 0,
12        }
src/data/article_data.py

Using the Article class to process the txt files

Finally, we want to use the Article class to process the text files. More precisely, we would like to aggregate the articles into a pandas DataFrame. This object has a to_json method that allows us to persist it for later introspection.

First, let us write a test to pin down our expectations.

 1import pandas as pd
 2from click.testing import CliRunner
 3from src.data.make_dataset import main
 4
 5TEST_DATA = {
 6    "scp-002.txt": """EUCLID\n
 7Item #: 002\n
 8Special Containment Procedures: Something something...\n
 9Description: Something else...\n
10""",
11    "scp-003.txt": """UNKNOWN\n
12Item #: 003\n
13Special Containment Procedures: Something something...\n
14Description: Something else...\n
15""",
16}
17
18
19def test_main():
20    runner = CliRunner()
21    with runner.isolated_filesystem():
22        for filename, text in TEST_DATA.items():
23            with open(filename, "w") as f:
24                f.write(text)
25        result = runner.invoke(main, [".", "."])
26        assert result.exit_code == 0
27        df = pd.read_json("data.json")
28        assert len(df.index) == 1
29        data = df.loc[0]
30        assert "Label" in data
31        assert data["Label"] == "EUCLID"
32        assert "Name" in data
33        assert data["Name"] == "Item #: 002"
34        assert "Procedures" in data
35        assert data["Procedures"] == "Something something..."
36        assert "Description" in data
37        assert data["Description"] == "Something else..."
tests/data/test_make_dataset.py

Here, we are using the dictionary TEST_DATA to write two files with two mock articles. The first is a regular article with a valid object class, the second one is an article we do not wish to process. As a result, we expect that only one article is present in the processed data. Note that we are using pandas' read_json method to obtain a DataFrame and, in turn, we are using DataFrame methods to assure that only one article is present and that the article data has been split up correctly.

To make this test pass we have to implement the following strategy:

  1. Start with an empty DataFrame.
  2. Parse each text file in the data/raw folder and turn it into an Article.
  3. Use Article's to_dict method to append the data to the DataFrame.
  4. Persist the DataFrame to a json file in the data/processed folder.

Here's a sketch implementation that uses the pathlib library's glob function to iterate through the text files.

 1import pandas as pd
 2from pathlib import Path
 3from src.data.article_data import Article
 4
 5
 6df = pd.DataFrame({})
 7for file in Path("data/raw").glob("scp-*.txt"):
 8    with file.open() as f:
 9        article = Article.from_text(f.readlines())
10    df = df.append(article.to_dict(), ignore_index=True)
11df.to_json("data/processed/data.json")
First sketch of the processing

From a software design perspective, this code leaves a lot to be desired. First, there are no log messages that will come in handy when things are off. Second, the paths are hard-coded and should be replaced by function parameters. Third and last, the Article's classmethod from_text throws a ValueError each time it encounters an article with an unknown object class. We have to deal with this kind of situation without letting the entire script fail.

Here's a revision of the sketch.

 1import click
 2import logging.config
 3import pandas as pd
 4from pathlib import Path
 5from src.data.article_data import Article
 6
 7
 8PROJECT_DIR = Path(__file__).resolve().parents[2]
 9logging.config.fileConfig(PROJECT_DIR / "logging_config.ini")
10logger = logging.getLogger(__name__)
11
12
13@click.command()
14@click.argument("input_filepath", type=click.Path(exists=True))
15@click.argument("output_filepath", type=click.Path())
16def main(input_filepath, output_filepath):
17    """ Runs data processing scripts to turn raw data from (../raw) into
18        cleaned data ready to be analyzed (saved in ../processed).
19    """
20    logger.info("making final data set from raw data")
21    df = pd.DataFrame({})
22    for file in Path(input_filepath).glob("scp-*.txt"):
23        logger.info("File: %s", file)
24        with file.open() as f:
25            try:
26                article = Article.from_text(f.readlines())
27            except ValueError as e:
28                logger.warning("ValueError in file %s: %s", file, e)
29                continue
30        df = df.append(article.to_dict(), ignore_index=True)
31    logger.info("DataFrame extracted. Writing to data.json in %s", output_filepath)
32    df.to_json(Path(output_filepath) / "data.json")
33    logger.info("Done.")
34
35
36if __name__ == "__main__":
37    main()
src/data/make_dataset.py

Note that we are emitting log warning messages whenever we encounter an unknown label but still continue with the processing.

Exercises

Just like in the last blog post, you can rapidly tackle an exercise by using git tags. For instance, if you want to tackle the first exercise, issue the command git tag ex-15 and start coding. If you want to compare your solution to mine, issue git diff sol-ex-15 when you have finished.

Git tag: ex-15   beginner

Add a __repr__ method to the Article class.

Git tag: ex-16   beginner

Add another test for the from_text method. The input shall be the same as in the test_from_text method except that you will leave out the name ("Some name "). Consequently, assert that the name of the resulting Article instance will be an empty string.

Git tag: ex-17   intermediate

Unfortunately, there are some SCP articles that slightly diverge from the usual naming convention for their parts. For instance, SCP-524 has a Special Containment Procedure (singular!), SCP-2944 has Secure Containment Procedures, and SCP-931 consists of haiku paragraphs. While we could certainly be a little more thorough when parsing them, I will ignore them for the rest of this blog post series (I have encountered 130 warnings when parsing the first 3000 SCP articles which is less than 0,5% of incorrectly parsed articles). However, if you want to, feel free to optimize the parsing procedure. For starters, allow for the "Description" part to start with either "Description:" or "Summary:". Do not forget to write tests!

Git tag: ex-18   intermediate

Raise RuntimeErrors whenever the Containment Procedures or the Description cannot be extracted. Catch these RuntimeErrors in make_dataset.py, log the error and continue with the for loop without adding the article to the DataFrame. Finally, add another test article in test_make_dataset.py with an unexpected beginning of the description and tests to test_article_data.py to make sure these RuntimeErrors are actually raised.

Quick analysis of the lengths

After we have extracted the (character) lengths of the two parts of the SCP articles, let us analyze them. We will use pandas to load the json file and compute some basic statistical measures.

Open jupyter lab (either by opening a terminal and issuing the command jupyter lab or by opening Anaconda, switching to the environment for the SCP project and open Jupyter Lab there), navigate to the notebooks folder of the SCP project and click the "+" icon above the folder breadcrumbs to fire up a new launcher.

/images/scp2_jupyterlab_launcher.png

In the opening Launcher tab, choose a Python 3 Notebook. Now you are all set up to experiment with data interactively. The following is a transcript of my Jupyter notebook session.

Computing statistics of the lengths

We want to check that all the transformations we have done so far are sane so that we can work with a cleaned up dataset.

1  import pandas as pd
2
3  df = pd.read_json("../data/processed/data.json")
4  df.head()
In[1]
DescriptionDescription_LengthLabelNameProceduresProcedures_Description_RatioProcedures_Length
0SCP-1256 is a 24-page pamphlet entitled 'Bees …1837SAFEItem #: SCP-1256Mobile Task Force Zeta-4 ('Beekeepers') is cur…0.224279412
1SCP-2987 is a modified MSI brand external hard…2187SAFEItem #: SCP-2987SCP-2987 is to be kept on floor 17 of Site-88….0.203475445
2SCP-2039 collectively refers to two distinct f…5399EUCLIDItem #: SCP-2039Presently, Foundation efforts at Research Faci…0.3687721991
3SCP-1530 is a two-story abandoned house locate…3893EUCLIDItem #: SCP-1530SCP-1530 is currently contained 120 meters fro…0.201387784
4SCP-1524 is the sole remaining specimen of a s…3211EUCLIDItem #: SCP-1524Both of SCP-1524's individual components are t…0.5303641703
Out[1]

Let's look at some statistics of the extracted text lengths and the ratio.

1  df.describe()
In[2]
Description_LengthProcedures_Description_RatioProcedures_Length
count2700.0000002700.0000002700.000000
mean3208.5422220.286840777.595556
std1658.3456740.293568519.808074
min61.0000000.0000000.000000
25%2104.7500000.145726414.750000
50%2887.0000000.229935656.500000
75%3957.0000000.353646994.250000
max31618.0000007.3770497922.000000
Out[2]

Whereas count, mean, min and max are self-explanatory, std stands for standard deviation. The rows with percentages are the 25%-, 50%-, and 75%-quantiles, respectively. They were defined in my Blog post on means and medians. Here's a short refresher: The 25%-quantile is a value such that 25% of the data is smaller than or equal to it and the other 75% of the data is greater than or equal to it. The 50%-quantile is also known as the median.

The minimum of 61 characters in Description_Length looks reasonable but a Containment Procedure with 0 characters? This has to be investigated. Before we do so, let us look at the same statistics but grouped by each label.

1  df.groupby("Label").describe().stack()
In[3]
Description_LengthProcedures_Description_RatioProcedures_Length
Label
EUCLIDcount1274.0000001274.0000001274.000000
mean3244.3618520.308139855.422292
std1701.6602290.273383529.896660
min428.0000000.011165148.000000
25%2179.2500000.169438497.250000
50%2935.5000000.259065727.000000
75%3977.7500000.3711861075.750000
max31618.0000006.0519487922.000000
KETERcount314.000000314.000000314.000000
mean3380.4872610.4012081128.343949
std1694.0072370.328462605.260134
min233.0000000.0000000.000000
25%2243.0000000.218239683.250000
50%3197.5000000.3326941028.000000
75%4192.2500000.4862121449.750000
max10141.0000003.7817263449.000000
SAFEcount1112.0000001112.0000001112.000000
mean3118.9514390.230143589.388489
std1592.7212150.293088392.807626
min61.0000000.01062664.000000
25%2003.0000000.118879326.000000
50%2791.5000000.178353488.500000
75%3860.7500000.277565730.500000
max12331.0000007.3770493680.000000
Out[3]

This is where it starts to get interesting! As safe SCPs are much easier to contain than euclid ones which in turn are easier to contain than keter SCPs, we expect that the Containment Procedures are easier to describe for safe ones and need more elaborate descriptions for keter ones. On average, this is reflected in the mean length of Containment Procedures (579 for safe, 833 for euclid and 1108 for keter).

Let us turn to the problematic cases of zero lengths.

1  df.loc[(df["Procedures_Length"] == 0) | (df["Description_Length"] == 0)]
In[4]
DescriptionDescription_LengthLabelNameProceduresProcedures_Description_RatioProcedures_Length
1340SCP-1994 is the general designation for a set …1376KETERItem #: SCP-19940.00
Out[4]

Thankfully, this is a single outlier. Investigating the article on the SCP Foundation web page and inspecting the html yields that the label "Special Containment Procedures" sits in its own p element so that we were not able to crawl this article correctly.

Let us ignore the outlier.

1  df = df.loc[df["Procedures_Length"] > 0]
In[5]

Finally, let us compute correlations between our features and the target. The correlation coefficient may be computed for number-valued random variables only. Fortunately, the nominal labels safe, euclid, and keter, carry ordinal information. That is to say, we can order them by their containment complexity. To make this even more explicit, let us assign numbers to the three labels. A safe label will be converted to -1, a euclid label to 0 and a keter label to 1 so that the order of the containment complexity is reflected by $\mathrm{safe} < \mathrm{euclid} < \mathrm{keter}$. However, the magnitude of this conversion is still open for discussion. We could also have choosen $10^{100}$ for keter and this would have influenced the correlation coefficients. But let's stick to our simple way of converting for now.

 1  COMPLEXITY = {
 2      "SAFE": -1,
 3      "EUCLID": 0,
 4      "KETER": 1
 5  }
 6
 7  def compute_complexity(label):
 8      return COMPLEXITY[label]
 9
10  df["Complexity"] = df["Label"].apply(compute_complexity)
11  df.corr()
In[6]
Description_LengthProcedures_Description_RatioProcedures_LengthComplexity
Description_Length1.000000-0.2938310.2206750.052532
Procedures_Description_Ratio-0.2938311.0000000.5775480.188953
Procedures_Length0.2206750.5775481.0000000.344329
Complexity0.0525320.1889530.3443291.000000
Out[6]

As it turns out, Complexity and Procedures_Length are positively correlated which is precisely what we have observed through the statistics that we have grouped by label. We also see that Description_Length is only very weakly correlated with Complexity: That is to say that there is no reason why, say, a safe SCP should not have a long description or why a keter SCP could not be described in a short manner.

Mathematical background behind TF-IDF

Before we get to enjoy the ease of use of sklearn's Transformer API to apply the TF-IDF transformation, let's try and get some understanding of it first. The easiest way to turn words into numbers is to count them. As simple as this sounds, this idea is the corner stone of the TF-IDF transformation.

Word count vectors

Let me make this more precise. Assume that we have articles \(A_1, \dotsc, A_n\). The vocabulary \(\mathcal{V} = \mathcal{V}_{A_1, \dotsc, A_n}\) is the set of unique words occurring in those articles. The set of all of our articles will also be called the document. For any article \(A\), the word count function is \[ \mathrm{wc}_{A}\colon \mathcal{V} \to \mathbb{N}, \, w \mapsto \text{Number of times \(w\) occurs in \(A\)}. \]

Once we have fixed the vocabulary, it is possible to turn the word count functions into word count vectors. First, we merely need to decide on an order of the vocabulary—the alphabetic ordering is a canonical choice! As soon as we have ordered the vocabulary, we may write it as \(\mathcal{V} = \{w_1, \dotsc, w_m\}\), where \(m\) is the total number of words in the vocabulary. Finally, we declare that the word count vector of the article \(A\) is \[ v_A = \bigl(\mathrm{wc}_A(w_1), \dotsc, \mathrm{wc}_A(w_m)\bigr). \]

Normalizing word counts

Instead of only talking about single words, we can more generally deal with terms. These can also be sequences of words that appear in our documents. Depending on how long those sequences are, they may carry information about the context that cannot be inferred from single words alone. Consequently, the word count vectors from before may more generally be called term count vectors (even though this does not seem to be standard language).

In general, vectors have the advantage of being comparable by using distance measures such as the Euclidian distance. Depending on the variety of articles and the precise application, however, there might be a problem. To make this concrete, let me illustrate this by a simple artificial example. Take the sentence "I am sure I am right about this" and transform it into a word count vector. Using alphabetic ordering, you should end up with the following.

wordaboutamIrightsurethis
count122111

Let us add a second "article" to our document. The text consists of that same sentence twice. Thus, we obtain

wordaboutamIrightsurethis
count244222

as its word count vector.

If you want to consider these two articles as similar or even the same, you will need to normalize the vectors. This means that before comparing two word count vectors you will divide them by their length. In this case, this approach will lead to the two articles being seen as the same. However, there are also reasons for wanting to tell these two articles apart: Even if they deal with the same point, the second one puts stronger emphasis on it by repetition.

To sum it up, depending on the application you might want to think about normalization of your word count vectors. In any case, the resulting word count vectors will be called term frequencies (even if you did not normalize) from now on. This concludes the first half of the TF-IDF transformation.

Inverse document frequency

Term frequency vectors suffer from another problem: Words that certainly occur in almost every English article such as "a", "the", and "is" but do not carry meaning will influence the similarity of articles. There are two ways to deal with this problem (in fact, they are often used in conjunction). The first is to simply ignore those words: Create a list of so-called stop words that will be ignored when building the vocabulary. The second way is to penalize words occurring in almost every article and boost rare words. This is precisely what inverse document frequency is doing.

Before we arrive at the precise definition, let us look at it from another angle. A word count is a measure that is local to a single article. This means that it does not depend on other articles in our document. If a single word count is high in that article, this might mean that this is an important word that potentially helps characterizing the article. However, if this word count is equally high in all the other articles then this word does not help us telling this article apart from the others (if everything is special then nothing is). Thus, there is the need for a trade-off between local measures (such as a single word count of a certain article) and global measures.

Inverse document frequency is such a global measure. To define it, let us concentrate on a single word \(w\) in our vocabulary. The document frequency \(\mathrm{df}(w)\) is the number of articles that \(w\) appears in. Consequently, the inverse document frequency is \(1/\mathrm{df}(w)\).

Now we are able to describe the TF-IDF transformation as a whole: For any article \(A\) in the document, multiply each word count in its word count vector with the inverse document frequency. In formulae: \[ \mathrm{TFIDF}(A) = \left(\frac{\mathrm{wc}_A(w_1)}{\mathrm{df}(w_1)}, \dotsc, \frac{\mathrm{wc}_A(w_m)}{\mathrm{df}(w_m)}\right). \]

Applying the TF-IDF transformation (Transcript of jupyter notebook session)

Before we apply the TF-IDF transformation, it is obligatory to put aside some test data for evaluating our model later. Otherwise, a future Machine Learning model would have access to statistics of the entire dataset and may deduce statistics of the test dataset afterwards. However, the entire purpose of the train-test-split is to evaluate the model on data it has not seen before.

 1  import pandas as pd
 2
 3  df = pd.read_json("../data/processed/data.json")
 4  df = df.loc[df["Procedures_Length"] > 0, [
 5      "Label",
 6      "Procedures",
 7      "Description",
 8      "Procedures_Length",
 9      "Description_Length",
10      "Procedures_Description_Ratio"
11  ]]
In[1]

Making a train-test-split

With sklearn, splitting a DataFrame reduces to calling the train_test_split function from the model_selection module. The test_size argument determines the relative size of the test set.

1  from sklearn.model_selection import train_test_split
2
3  X, y = df.drop(columns=["Label"]), df["Label"]
4  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
In[2]

Note that we split up our target column Label from the rest so that it will not be included in the following transformations.

Fitting TfidfVectorizers

Since we have two text columns (Procedures and Description), it is best to fit two =TfidfVectorizer=s so that all information contained in those two separately will be preserved. The rest of the features should be scaled as certain models encounter numerical problems when two features are on very different scales (that is to say one feature usually is very large, e.g. $\gg 10^6$, while another only attains values between 0 and 1). To do all of this in one go, sklearn provides us with a ColumnTransformer that takes a list of tuples consisting of a column name and a transformer that should transform the corresponding column. Additionally, the ColumnTransformer's remainder keyword argument may be another transformer that will be applied to the remaining columns. Here's how to use it:

 1  from sklearn.feature_extraction.text import TfidfVectorizer
 2  from sklearn.compose import ColumnTransformer
 3  from sklearn.preprocessing import StandardScaler
 4
 5
 6  columnwise_tfidf = ColumnTransformer(
 7      [
 8          (
 9              "procedures",
10              TfidfVectorizer(),
11              "Procedures"
12          ),
13          (
14              "desc",
15              TfidfVectorizer(),
16              "Description"
17          )
18      ],
19      remainder=StandardScaler(),
20      n_jobs=-1,
21  )
In[3]

First, the first item in the tuple is a name for the transformation for later reference. Second, the TfidfVectorizer with standard arguments constructs the TF-IDF vectors in almost the same way that I explained it in the Blog Post accompanying this part of the project. The only difference is that the document frequency of each word is increased by one to prevent zero divisions. Third and last, the StandardScaler scales the remaining features such that they have zero mean and unit standard deviation.

Applying this ColumnTransformer to our train set follows the usual sklearn API. Each Transformer has fit and transform methods. Here, the first is used solely on the train set to fit the Transformer. Afterwards, the second may be used to transform both the train and test set.

1  columnwise_tfidf.fit(X_train)
2  X_train_transformed = columnwise_tfidf.transform(X_train)
In[4]

Conveniently, most transformers have a fit_transform method that combines these two steps into one:

1  X_train_transformed = columnwise_tfidf.fit_transform(X_train)
In[5]

Extracting keywords

Let us use the fitted transformers to extract keywords from articles. First, we will extract the vocabulary as determined by the =TfidfVectorizer=s. To distinguish between the words from the Procedures and the Description, we will prepend each of them with a prefix.

1  def vocabulary():
2      return (
3          [f"proc__{name}" for name in columnwise_tfidf.named_transformers_["procedures"].get_feature_names()]
4          + [f"desc__{name}" for name in columnwise_tfidf.named_transformers_["desc"].get_feature_names()]
5      )
In[6]

Note that the names we have provided for the =TfidfVectorizer=s earlier now come into play.

Second, let's write a function accepting an article and returning a DataFrame containing the words with the highest frequencies.

1  def extract_keywords(article, topn=10):
2      article_transformed = columnwise_tfidf.transform(article).toarray()[0]
3      frequencies = list(zip(vocabulary(), article_transformed))
4      frequencies.sort(key=lambda x: -x[1])
5      return pd.DataFrame(frequencies[:topn])
In[7]

Finally, let's extract keywords from one of the most iconic SCP articles: The one for SCP-682. This is one of the best examples of Keter class SCPs.

1  scp_682 = df.loc[df["Description"].str.startswith("SCP-682")].drop(columns=["Label"])
2  extract_keywords(scp_682)
In[8]
01
0proc__6820.767357
1desc__kia0.738121
2desc__6820.523255
3desc__agent0.171312
4desc__personnel0.156161
5proc__speak0.153737
6proc__acid0.144138
7proc__to0.133515
8desc__pvt0.110179
9proc__scp0.107281
Out[8]

This does not look too promising. First, maybe numbers should be ignored. Then, there are words "to", "of" appearing in almost every article in english. "speak" might also not be telling much. This will only get worse if we look at the top 30 keywords.

1  extract_keywords(scp_682, topn=30)
In[9]
01
0proc__6820.767357
1desc__kia0.738121
2desc__6820.523255
3desc__agent0.171312
4desc__personnel0.156161
5proc__speak0.153737
6proc__acid0.144138
7proc__to0.133515
8desc__pvt0.110179
9proc__scp0.107281
10desc__handled0.106319
11proc__attempts0.098297
12proc__reacted0.095920
13desc__occurrence0.095232
14proc__incapacitation0.091120
15proc__of0.090828
16proc__fear0.087715
17proc__rage0.087715
18proc__hydrochloric0.085073
19proc__massive0.085073
20proc__frequent0.082915
21proc__provoking0.082915
22proc__breach0.082463
23desc__scp0.081648
24proc__should0.080923
25proc__lining0.079510
26proc__called0.078116
27proc__incapacitated0.078116
28proc__force0.078011
29proc__destroying0.076869
Out[9]

Fine-tuning the TfidfVectorizer

Fortunately, TfidfVectorizer has a lot of options to fine-tune its behavior. First and maybe most importantly, we can enforce that certain words should be ignored via the stop_words keyword argument. It either expects the string "english" and then uses a list constructed by the sklearn developers (with its own set of disadvantages) or it expects a list of strings containing the words that shall be ignored. Second, we can specify a regex pattern via the token_pattern keyword argument. This pattern will be used when parsing the articles to build up the vocabulary. The standard pattern includes single words containing letters and numbers; we will modify it to only parse for words containing letters.

 1  columnwise_tfidf = ColumnTransformer(
 2      [
 3          (
 4              "procedures",
 5              TfidfVectorizer(
 6                  stop_words="english",
 7                  strip_accents='unicode',
 8                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
 9              ),
10              "Procedures"
11          ),
12          (
13              "desc",
14              TfidfVectorizer(
15                  stop_words="english",
16                  strip_accents='unicode',
17                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b'
18              ),
19              "Description"
20          )
21      ],
22      remainder=StandardScaler()
23  )
24
25  columnwise_tfidf.fit(X_train)
In[10]
  ColumnTransformer(n_jobs=None,
                    remainder=StandardScaler(copy=True, with_mean=True,
                                             with_std=True),
                    sparse_threshold=0.3, transformer_weights=None,
                    transformers=[('procedures',
                                   TfidfVectorizer(analyzer='word', binary=False,
                                                   decode_error='strict',
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1...
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1,
                                                   ngram_range=(1, 1), norm='l2',
                                                   preprocessor=None,
                                                   smooth_idf=True,
                                                   stop_words='english',
                                                   strip_accents='unicode',
                                                   sublinear_tf=False,
                                                   token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
                                                   tokenizer=None, use_idf=True,
                                                   vocabulary=None),
                                   'Description')],
                    verbose=False)
Out[10]
1  extract_keywords(scp_682, topn=30)
In[11]
01
0desc__kia0.890278
1proc__speak0.272335
2proc__acid0.255331
3desc__agent0.206627
4proc__scp0.190041
5desc__personnel0.188352
6proc__attempts0.174127
7proc__reacted0.169915
8proc__incapacitation0.161413
9proc__fear0.155381
10proc__rage0.155381
11proc__hydrochloric0.150702
12proc__massive0.150702
13proc__frequent0.146879
14proc__provoking0.146879
15proc__breach0.146078
16proc__lining0.140847
17proc__called0.138377
18proc__incapacitated0.138377
19proc__force0.138192
20proc__destroying0.136168
21proc__containment0.135959
22desc__pvt0.132891
23proc__difficulty0.132345
24proc__submerged0.132345
25proc__best0.130666
26desc__handled0.128236
27proc__chamber0.126861
28proc__plate0.125041
29proc__development0.123843
Out[11]

This looks much better. A few remarks:

  • I had to google for the two abbreviations "kia" and "pvt". The first is the abbreviation for "killed in action" while the second stands for the military rank "Private".
  • On second thought, "speak" may contain the information that the SCP object is able to speak and, thusly, might hint at it being sapient. As sapient SCPs are probably more likely to be of class euclid or keter, this could be valuable information for a Machine Learning model.
  • One could start building a custom list of stop words more suitable for parsing SCP articles. In the list above, the words "best" and "called" as well as "scp" could be ignored. I will postpone this to the next part of this series of posts. Because some models give some insight in their learning process, we can use them to see if their decisions are based on filler words.

Conclusion

In this blog post, we have learned how to use Jupyter Notebooks and the pandas library to extract basic statistics from SCP articles. Furthermore, we have used a basic TF-IDF transformation to extract keywords from SCP articles.