Classifying SCPs, Part 2: Data transformation (TF-IDF) and preprocessing

After we have obtained data through the means described in the first part of this blog post series, it is time to deal with data transformations and data preprocessing. While humans can comprehend textual information in the form of articles, it is hard to digest for a Machine Learning algorithm. In this blog post, we will transform the textual information into certain vectors assigning a number to each word in the vocabulary of the set of articles: This is what the TF-IDF (Term Frequency - Inverse Document Frequency) is all about.

In comparison to the web crawler post, this one is more mathematical in nature. Instead of evaluating technical approaches and executing them in a test-driven manner, we will have to understand the mathematical background behind the algorithm to put it to good use.

To make the transition as gentle as possible, let us do a warm-up that is closer to the technical spirit of the last blog post: We use the text files produced by the web crawler from the last blog post to extract certain lengths of paragraphs and research in what way these help us to determine the Object Class of the article.

After we have understood how TF-IDF works, we can use it to transform our articles into TF-IDF vectors. Consequently, we will already be able to extract keywords from each article.

Warm-up: Extracting the paragraph lengths

To structure the text from the articles, let us use a custom class.

Basic version of the `Article` class

1class Article(object):
2    def __init__(self, label, name, procedures, desc):
3        self.label = label.strip()
4        self.name = name.strip()
5        self.procedures = procedures.strip()
6        self.desc = desc.strip()

src/data/article_data.py

The logic that splits up the text from the text files into attributes of the class will be a classmethod that accepts a list of lines of text and returns a readily constructed Article instance.

 1class Article(object):
 2    # __init__ omitted...
 3
 4    @classmethod
 5    def from_text(cls, lines):
 6        label, *text = lines
 7        text = "".join(text)
 8        name, rest = text.split("Special Containment Procedures:")
 9        procedures, desc = rest.split("Description:")
10        return cls(label, name, procedures, desc)

src/data/article_data.py

Here's a basic test that shows how to use the classmethod to obtain an Article instance.

 1from src.data.article_data import Article
 2
 3
 4def test_from_text():
 5    procedures = [
 6        "Special Containment Procedures: Something...",
 7        "Something part two...",
 8    ]
 9    description = "Description: Something else..."
10    article = Article.from_text(["SAFE", "Some name   ", *procedures, description])
11    assert article.label == "SAFE"
12    assert article.name == "Some name"
13    assert article.procedures == "Something...Something part two..."
14    assert article.desc == "Something else..."

tests/data/test_article_data.py

Validation of the label through a `@property`

As mentioned in the last part of this series, we are only concentrating on the labels SAFE, EUCLID and KETER. To account for this, we need to validate that the incoming label is one of those. We are a little more lenient and also accept labels that only start with one of those three labels.

Let us write tests first to define the desired behavior.

 1import pytest
 2from src.data.article_data import Article
 3
 4
 5@pytest.fixture
 6def article():
 7    return Article("SAFE", "", "", "")
 8
 9
10@pytest.mark.parametrize("label", ["SAFE", "EUCLID", "KETER"])
11def test_set_regular_label(article, label):
12    article.label = label
13    assert article.label == label
14    article.label = label + "SOMETHING"
15    assert article.label == label
16
17
18def test_set_unknown_label(article):
19    with pytest.raises(ValueError) as excinfo:
20        article.label = "unknown"
21    assert "unknown" in str(excinfo)

tests/data/test_article_data.py

In the tests above, we are using a fixture that gives us an initialized Article instance. Then, we are defining the regular behavior of the setter (we are expecting the label to accept the three main object classes as well as labels that start with those) and what happens when the setter encounters an unknown label (we are expecting a ValueError, enforced via the raises helper).

Because we have not written any validation for the label attribute yet, the tests fail. To account for these kinds of validations, Python has @property decorators that allow for custom getter and setter methods.

 1class Article(object):
 2
 3    ALLOWED_LABELS = ("SAFE", "EUCLID", "KETER")
 4
 5    # __init__ omitted...
 6
 7    @property
 8    def label(self):
 9        return self._label
10
11    @label.setter
12    def label(self, orig_label):
13        labels = [
14            label for label in self.ALLOWED_LABELS if orig_label.startswith(label)
15        ]
16        if not labels:
17            raise ValueError(f"Unknown label '{orig_label}'!")
18        self._label = labels.pop()

src/data/article_data.py

The Python interpreter calls the method decorated with @label.setter as soon as it encounters the line self.label = label in the __init__ method. As a result, code that uses this class has to deal with ValueErrors when constructing instances.

Adding a `to_dict` method

While the Article class is responsible for extracting information from the articles, it is much easier to use a simple dictionary when persisting extracted information. That is because the json library can serialize Python dictionaries directly; additionally, the pandas Data Science library is able to use dictionaries to construct their main object: a DataFrame. As a result, we need to write a to_dict method that turns an Article instance into a plain dictionary. Aside from the four attributes of the Article class, we also require the dictionary to contain the (character) lengths of the Procedures and the Descriptions as well as the Ratio of these two lengths.

 1def test_to_dict_trivial_article(article):
 2    d = article.to_dict()
 3    assert "Label" in d
 4    assert d["Label"] == "SAFE"
 5    assert "Name" in d
 6    assert "Procedures" in d
 7    assert "Description" in d
 8    assert "Procedures_Length" in d
 9    assert d["Procedures_Length"] == 0
10    assert "Description_Length" in d
11    assert d["Description_Length"] == 0
12    assert "Procedures_Description_Ratio" in d
13    assert d["Procedures_Description_Ratio"] == 0
14
15
16def test_to_dict(article):
17    article.name = "Test"
18    article.procedures = "TestTest"
19    article.desc = "Test"
20    d = article.to_dict()
21    assert "Label" in d
22    assert d["Label"] == "SAFE"
23    assert "Name" in d
24    assert d["Name"] == "Test"
25    assert "Procedures" in d
26    assert d["Procedures"] == "TestTest"
27    assert "Description" in d
28    assert d["Description"] == "Test"
29    assert "Procedures_Length" in d
30    assert d["Procedures_Length"] == 8
31    assert "Description_Length" in d
32    assert d["Description_Length"] == 4
33    assert "Procedures_Description_Ratio" in d
34    assert d["Procedures_Description_Ratio"] == 2

tests/data/test_article_data.py

The implementation is straightforward and uses a dictionary comprehension.

 1    def to_dict(self):
 2        return {
 3            "Label": self.label,
 4            "Name": self.name,
 5            "Procedures": self.procedures,
 6            "Description": self.desc,
 7            "Procedures_Length": len(self.procedures),
 8            "Description_Length": len(self.desc),
 9            "Procedures_Description_Ratio": len(self.procedures) / len(self.desc)
10            if len(self.desc) > 0
11            else 0,
12        }

src/data/article_data.py

Using the `Article` class to process the `txt` files

Finally, we want to use the Article class to process the text files. More precisely, we would like to aggregate the articles into a pandas DataFrame. This object has a to_json method that allows us to persist it for later introspection.

First, let us write a test to pin down our expectations.

 1import pandas as pd
 2from click.testing import CliRunner
 3from src.data.make_dataset import main
 4
 5TEST_DATA = {
 6    "scp-002.txt": """EUCLID\n
 7Item #: 002\n
 8Special Containment Procedures: Something something...\n
 9Description: Something else...\n
10""",
11    "scp-003.txt": """UNKNOWN\n
12Item #: 003\n
13Special Containment Procedures: Something something...\n
14Description: Something else...\n
15""",
16}
17
18
19def test_main():
20    runner = CliRunner()
21    with runner.isolated_filesystem():
22        for filename, text in TEST_DATA.items():
23            with open(filename, "w") as f:
24                f.write(text)
25        result = runner.invoke(main, [".", "."])
26        assert result.exit_code == 0
27        df = pd.read_json("data.json")
28        assert len(df.index) == 1
29        data = df.loc[0]
30        assert "Label" in data
31        assert data["Label"] == "EUCLID"
32        assert "Name" in data
33        assert data["Name"] == "Item #: 002"
34        assert "Procedures" in data
35        assert data["Procedures"] == "Something something..."
36        assert "Description" in data
37        assert data["Description"] == "Something else..."

tests/data/test_make_dataset.py

Here, we are using the dictionary TEST_DATA to write two files with two mock articles. The first is a regular article with a valid object class, the second one is an article we do not wish to process. As a result, we expect that only one article is present in the processed data. Note that we are using pandas' read_json method to obtain a DataFrame and, in turn, we are using DataFrame methods to assure that only one article is present and that the article data has been split up correctly.

To make this test pass we have to implement the following strategy:

Start with an empty DataFrame.
Parse each text file in the data/raw folder and turn it into an Article.
Use Article's to_dict method to append the data to the DataFrame.
Persist the DataFrame to a json file in the data/processed folder.

Here's a sketch implementation that uses the pathlib library's glob function to iterate through the text files.

 1import pandas as pd
 2from pathlib import Path
 3from src.data.article_data import Article
 4
 5
 6df = pd.DataFrame({})
 7for file in Path("data/raw").glob("scp-*.txt"):
 8    with file.open() as f:
 9        article = Article.from_text(f.readlines())
10    df = df.append(article.to_dict(), ignore_index=True)
11df.to_json("data/processed/data.json")

First sketch of the processing

From a software design perspective, this code leaves a lot to be desired. First, there are no log messages that will come in handy when things are off. Second, the paths are hard-coded and should be replaced by function parameters. Third and last, the Article's classmethod from_text throws a ValueError each time it encounters an article with an unknown object class. We have to deal with this kind of situation without letting the entire script fail.

Here's a revision of the sketch.

 1import click
 2import logging.config
 3import pandas as pd
 4from pathlib import Path
 5from src.data.article_data import Article
 6
 7
 8PROJECT_DIR = Path(__file__).resolve().parents[2]
 9logging.config.fileConfig(PROJECT_DIR / "logging_config.ini")
10logger = logging.getLogger(__name__)
11
12
13@click.command()
14@click.argument("input_filepath", type=click.Path(exists=True))
15@click.argument("output_filepath", type=click.Path())
16def main(input_filepath, output_filepath):
17    """ Runs data processing scripts to turn raw data from (../raw) into
18        cleaned data ready to be analyzed (saved in ../processed).
19    """
20    logger.info("making final data set from raw data")
21    df = pd.DataFrame({})
22    for file in Path(input_filepath).glob("scp-*.txt"):
23        logger.info("File: %s", file)
24        with file.open() as f:
25            try:
26                article = Article.from_text(f.readlines())
27            except ValueError as e:
28                logger.warning("ValueError in file %s: %s", file, e)
29                continue
30        df = df.append(article.to_dict(), ignore_index=True)
31    logger.info("DataFrame extracted. Writing to data.json in %s", output_filepath)
32    df.to_json(Path(output_filepath) / "data.json")
33    logger.info("Done.")
34
35
36if __name__ == "__main__":
37    main()

src/data/make_dataset.py

Note that we are emitting log warning messages whenever we encounter an unknown label but still continue with the processing.

Exercises

Just like in the last blog post, you can rapidly tackle an exercise by using git tags. For instance, if you want to tackle the first exercise, issue the command git tag ex-15 and start coding. If you want to compare your solution to mine, issue git diff sol-ex-15 when you have finished.

Git tag: `ex-15` beginner

Add a __repr__ method to the Article class.

Git tag: `ex-16` beginner

Add another test for the from_text method. The input shall be the same as in the test_from_text method except that you will leave out the name ("Some name "). Consequently, assert that the name of the resulting Article instance will be an empty string.

Git tag: `ex-17` intermediate

Unfortunately, there are some SCP articles that slightly diverge from the usual naming convention for their parts. For instance, SCP-524 has a Special Containment Procedure (singular!), SCP-2944 has Secure Containment Procedures, and SCP-931 consists of haiku paragraphs. While we could certainly be a little more thorough when parsing them, I will ignore them for the rest of this blog post series (I have encountered 130 warnings when parsing the first 3000 SCP articles which is less than 0,5% of incorrectly parsed articles). However, if you want to, feel free to optimize the parsing procedure. For starters, allow for the "Description" part to start with either "Description:" or "Summary:". Do not forget to write tests!

Git tag: `ex-18` intermediate

Raise RuntimeErrors whenever the Containment Procedures or the Description cannot be extracted. Catch these RuntimeErrors in make_dataset.py, log the error and continue with the for loop without adding the article to the DataFrame. Finally, add another test article in test_make_dataset.py with an unexpected beginning of the description and tests to test_article_data.py to make sure these RuntimeErrors are actually raised.

Quick analysis of the lengths

After we have extracted the (character) lengths of the two parts of the SCP articles, let us analyze them. We will use pandas to load the json file and compute some basic statistical measures.

Open jupyter lab (either by opening a terminal and issuing the command jupyter lab or by opening Anaconda, switching to the environment for the SCP project and open Jupyter Lab there), navigate to the notebooks folder of the SCP project and click the "+" icon above the folder breadcrumbs to fire up a new launcher.

/images/scp2_jupyterlab_launcher.png

In the opening Launcher tab, choose a Python 3 Notebook. Now you are all set up to experiment with data interactively. The following is a transcript of my Jupyter notebook session.

Computing statistics of the lengths

We want to check that all the transformations we have done so far are sane so that we can work with a cleaned up dataset.

1  import pandas as pd
2
3  df = pd.read_json("../data/processed/data.json")
4  df.head()

In[1]

	Description	Description_Length	Label	Name	Procedures	Procedures_Description_Ratio	Procedures_Length
0	SCP-1256 is a 24-page pamphlet entitled 'Bees …	1837	SAFE	Item #: SCP-1256	Mobile Task Force Zeta-4 ('Beekeepers') is cur…	0.224279	412
1	SCP-2987 is a modified MSI brand external hard…	2187	SAFE	Item #: SCP-2987	SCP-2987 is to be kept on floor 17 of Site-88….	0.203475	445
2	SCP-2039 collectively refers to two distinct f…	5399	EUCLID	Item #: SCP-2039	Presently, Foundation efforts at Research Faci…	0.368772	1991
3	SCP-1530 is a two-story abandoned house locate…	3893	EUCLID	Item #: SCP-1530	SCP-1530 is currently contained 120 meters fro…	0.201387	784
4	SCP-1524 is the sole remaining specimen of a s…	3211	EUCLID	Item #: SCP-1524	Both of SCP-1524's individual components are t…	0.530364	1703

Out[1]

Let's look at some statistics of the extracted text lengths and the ratio.

1  df.describe()

In[2]

	Description_Length	Procedures_Description_Ratio	Procedures_Length
count	2700.000000	2700.000000	2700.000000
mean	3208.542222	0.286840	777.595556
std	1658.345674	0.293568	519.808074
min	61.000000	0.000000	0.000000
25%	2104.750000	0.145726	414.750000
50%	2887.000000	0.229935	656.500000
75%	3957.000000	0.353646	994.250000
max	31618.000000	7.377049	7922.000000

Out[2]

Whereas count, mean, min and max are self-explanatory, std stands for standard deviation. The rows with percentages are the 25%-, 50%-, and 75%-quantiles, respectively. They were defined in my Blog post on means and medians. Here's a short refresher: The 25%-quantile is a value such that 25% of the data is smaller than or equal to it and the other 75% of the data is greater than or equal to it. The 50%-quantile is also known as the median.

The minimum of 61 characters in Description_Length looks reasonable but a Containment Procedure with 0 characters? This has to be investigated. Before we do so, let us look at the same statistics but grouped by each label.

1  df.groupby("Label").describe().stack()

In[3]

		Description_Length	Procedures_Description_Ratio	Procedures_Length
Label
EUCLID	count	1274.000000	1274.000000	1274.000000
	mean	3244.361852	0.308139	855.422292
	std	1701.660229	0.273383	529.896660
	min	428.000000	0.011165	148.000000
	25%	2179.250000	0.169438	497.250000
	50%	2935.500000	0.259065	727.000000
	75%	3977.750000	0.371186	1075.750000
	max	31618.000000	6.051948	7922.000000
KETER	count	314.000000	314.000000	314.000000
	mean	3380.487261	0.401208	1128.343949
	std	1694.007237	0.328462	605.260134
	min	233.000000	0.000000	0.000000
	25%	2243.000000	0.218239	683.250000
	50%	3197.500000	0.332694	1028.000000
	75%	4192.250000	0.486212	1449.750000
	max	10141.000000	3.781726	3449.000000
SAFE	count	1112.000000	1112.000000	1112.000000
	mean	3118.951439	0.230143	589.388489
	std	1592.721215	0.293088	392.807626
	min	61.000000	0.010626	64.000000
	25%	2003.000000	0.118879	326.000000
	50%	2791.500000	0.178353	488.500000
	75%	3860.750000	0.277565	730.500000
	max	12331.000000	7.377049	3680.000000

Out[3]

This is where it starts to get interesting! As safe SCPs are much easier to contain than euclid ones which in turn are easier to contain than keter SCPs, we expect that the Containment Procedures are easier to describe for safe ones and need more elaborate descriptions for keter ones. On average, this is reflected in the mean length of Containment Procedures (579 for safe, 833 for euclid and 1108 for keter).

Let us turn to the problematic cases of zero lengths.

1  df.loc[(df["Procedures_Length"] == 0) | (df["Description_Length"] == 0)]

In[4]

	Description	Description_Length	Label	Name	Procedures	Procedures_Description_Ratio	Procedures_Length
1340	SCP-1994 is the general designation for a set …	1376	KETER	Item #: SCP-1994		0.0	0

Out[4]

Thankfully, this is a single outlier. Investigating the article on the SCP Foundation web page and inspecting the html yields that the label "Special Containment Procedures" sits in its own p element so that we were not able to crawl this article correctly.

Let us ignore the outlier.

1  df = df.loc[df["Procedures_Length"] > 0]

In[5]

Finally, let us compute correlations between our features and the target. The correlation coefficient may be computed for number-valued random variables only. Fortunately, the nominal labels safe, euclid, and keter, carry ordinal information. That is to say, we can order them by their containment complexity. To make this even more explicit, let us assign numbers to the three labels. A safe label will be converted to -1, a euclid label to 0 and a keter label to 1 so that the order of the containment complexity is reflected by $\mathrm{safe} < \mathrm{euclid} < \mathrm{keter}$. However, the magnitude of this conversion is still open for discussion. We could also have choosen $10^{100}$ for keter and this would have influenced the correlation coefficients. But let's stick to our simple way of converting for now.

 1  COMPLEXITY = {
 2      "SAFE": -1,
 3      "EUCLID": 0,
 4      "KETER": 1
 5  }
 6
 7  def compute_complexity(label):
 8      return COMPLEXITY[label]
 9
10  df["Complexity"] = df["Label"].apply(compute_complexity)
11  df.corr()

In[6]

	Description_Length	Procedures_Description_Ratio	Procedures_Length	Complexity
Description_Length	1.000000	-0.293831	0.220675	0.052532
Procedures_Description_Ratio	-0.293831	1.000000	0.577548	0.188953
Procedures_Length	0.220675	0.577548	1.000000	0.344329
Complexity	0.052532	0.188953	0.344329	1.000000

Out[6]

As it turns out, Complexity and Procedures_Length are positively correlated which is precisely what we have observed through the statistics that we have grouped by label. We also see that Description_Length is only very weakly correlated with Complexity: That is to say that there is no reason why, say, a safe SCP should not have a long description or why a keter SCP could not be described in a short manner.

Mathematical background behind TF-IDF

Before we get to enjoy the ease of use of sklearn's Transformer API to apply the TF-IDF transformation, let's try and get some understanding of it first. The easiest way to turn words into numbers is to count them. As simple as this sounds, this idea is the corner stone of the TF-IDF transformation.

Word count vectors

Let me make this more precise. Assume that we have articles $A_1, \dotsc, A_n$. The vocabulary $\mathcal{V} = \mathcal{V}_{A_1, \dotsc, A_n}$ is the set of unique words occurring in those articles. The set of all of our articles will also be called the document. For any article $A$, the word count function is \[ \mathrm{wc}_{A}\colon \mathcal{V} \to \mathbb{N}, \, w \mapsto \text{Number of times $w$ occurs in $A$}. \]

Once we have fixed the vocabulary, it is possible to turn the word count functions into word count vectors. First, we merely need to decide on an order of the vocabulary—the alphabetic ordering is a canonical choice! As soon as we have ordered the vocabulary, we may write it as $\mathcal{V} = \{w_1, \dotsc, w_m\}$, where $m$ is the total number of words in the vocabulary. Finally, we declare that the word count vector of the article $A$ is \[ v_A = \bigl(\mathrm{wc}_A(w_1), \dotsc, \mathrm{wc}_A(w_m)\bigr). \]

Normalizing word counts

Instead of only talking about single words, we can more generally deal with terms. These can also be sequences of words that appear in our documents. Depending on how long those sequences are, they may carry information about the context that cannot be inferred from single words alone. Consequently, the word count vectors from before may more generally be called term count vectors (even though this does not seem to be standard language).

In general, vectors have the advantage of being comparable by using distance measures such as the Euclidian distance. Depending on the variety of articles and the precise application, however, there might be a problem. To make this concrete, let me illustrate this by a simple artificial example. Take the sentence "I am sure I am right about this" and transform it into a word count vector. Using alphabetic ordering, you should end up with the following.

word	about	am	I	right	sure	this
count	1	2	2	1	1	1

Let us add a second "article" to our document. The text consists of that same sentence twice. Thus, we obtain

word	about	am	I	right	sure	this
count	2	4	4	2	2	2

as its word count vector.

If you want to consider these two articles as similar or even the same, you will need to normalize the vectors. This means that before comparing two word count vectors you will divide them by their length. In this case, this approach will lead to the two articles being seen as the same. However, there are also reasons for wanting to tell these two articles apart: Even if they deal with the same point, the second one puts stronger emphasis on it by repetition.

To sum it up, depending on the application you might want to think about normalization of your word count vectors. In any case, the resulting word count vectors will be called term frequencies (even if you did not normalize) from now on. This concludes the first half of the TF-IDF transformation.

Inverse document frequency

Term frequency vectors suffer from another problem: Words that certainly occur in almost every English article such as "a", "the", and "is" but do not carry meaning will influence the similarity of articles. There are two ways to deal with this problem (in fact, they are often used in conjunction). The first is to simply ignore those words: Create a list of so-called stop words that will be ignored when building the vocabulary. The second way is to penalize words occurring in almost every article and boost rare words. This is precisely what inverse document frequency is doing.

Before we arrive at the precise definition, let us look at it from another angle. A word count is a measure that is local to a single article. This means that it does not depend on other articles in our document. If a single word count is high in that article, this might mean that this is an important word that potentially helps characterizing the article. However, if this word count is equally high in all the other articles then this word does not help us telling this article apart from the others (if everything is special then nothing is). Thus, there is the need for a trade-off between local measures (such as a single word count of a certain article) and global measures.

Inverse document frequency is such a global measure. To define it, let us concentrate on a single word $w$ in our vocabulary. The document frequency $\mathrm{df}(w)$ is the number of articles that $w$ appears in. Consequently, the inverse document frequency is $1/\mathrm{df}(w)$.

Now we are able to describe the TF-IDF transformation as a whole: For any article $A$ in the document, multiply each word count in its word count vector with the inverse document frequency. In formulae: \[ \mathrm{TFIDF}(A) = \left(\frac{\mathrm{wc}_A(w_1)}{\mathrm{df}(w_1)}, \dotsc, \frac{\mathrm{wc}_A(w_m)}{\mathrm{df}(w_m)}\right). \]

Applying the TF-IDF transformation (Transcript of jupyter notebook session)

Before we apply the TF-IDF transformation, it is obligatory to put aside some test data for evaluating our model later. Otherwise, a future Machine Learning model would have access to statistics of the entire dataset and may deduce statistics of the test dataset afterwards. However, the entire purpose of the train-test-split is to evaluate the model on data it has not seen before.

 1  import pandas as pd
 2
 3  df = pd.read_json("../data/processed/data.json")
 4  df = df.loc[df["Procedures_Length"] > 0, [
 5      "Label",
 6      "Procedures",
 7      "Description",
 8      "Procedures_Length",
 9      "Description_Length",
10      "Procedures_Description_Ratio"
11  ]]

In[1]

Making a train-test-split

With sklearn, splitting a DataFrame reduces to calling the train_test_split function from the model_selection module. The test_size argument determines the relative size of the test set.

1  from sklearn.model_selection import train_test_split
2
3  X, y = df.drop(columns=["Label"]), df["Label"]
4  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In[2]

Note that we split up our target column Label from the rest so that it will not be included in the following transformations.

Fitting `TfidfVectorizers`

Since we have two text columns (Procedures and Description), it is best to fit two =TfidfVectorizer=s so that all information contained in those two separately will be preserved. The rest of the features should be scaled as certain models encounter numerical problems when two features are on very different scales (that is to say one feature usually is very large, e.g. $\gg 10^6$, while another only attains values between 0 and 1). To do all of this in one go, sklearn provides us with a ColumnTransformer that takes a list of tuples consisting of a column name and a transformer that should transform the corresponding column. Additionally, the ColumnTransformer's remainder keyword argument may be another transformer that will be applied to the remaining columns. Here's how to use it:

 1  from sklearn.feature_extraction.text import TfidfVectorizer
 2  from sklearn.compose import ColumnTransformer
 3  from sklearn.preprocessing import StandardScaler
 4
 5
 6  columnwise_tfidf = ColumnTransformer(
 7      [
 8          (
 9              "procedures",
10              TfidfVectorizer(),
11              "Procedures"
12          ),
13          (
14              "desc",
15              TfidfVectorizer(),
16              "Description"
17          )
18      ],
19      remainder=StandardScaler(),
20      n_jobs=-1,
21  )

In[3]

First, the first item in the tuple is a name for the transformation for later reference. Second, the TfidfVectorizer with standard arguments constructs the TF-IDF vectors in almost the same way that I explained it in the Blog Post accompanying this part of the project. The only difference is that the document frequency of each word is increased by one to prevent zero divisions. Third and last, the StandardScaler scales the remaining features such that they have zero mean and unit standard deviation.

Applying this ColumnTransformer to our train set follows the usual sklearn API. Each Transformer has fit and transform methods. Here, the first is used solely on the train set to fit the Transformer. Afterwards, the second may be used to transform both the train and test set.

1  columnwise_tfidf.fit(X_train)
2  X_train_transformed = columnwise_tfidf.transform(X_train)

In[4]

Conveniently, most transformers have a fit_transform method that combines these two steps into one:

1  X_train_transformed = columnwise_tfidf.fit_transform(X_train)

In[5]

Extracting keywords

Let us use the fitted transformers to extract keywords from articles. First, we will extract the vocabulary as determined by the =TfidfVectorizer=s. To distinguish between the words from the Procedures and the Description, we will prepend each of them with a prefix.

1  def vocabulary():
2      return (
3          [f"proc__{name}" for name in columnwise_tfidf.named_transformers_["procedures"].get_feature_names()]
4          + [f"desc__{name}" for name in columnwise_tfidf.named_transformers_["desc"].get_feature_names()]
5      )

In[6]

Note that the names we have provided for the =TfidfVectorizer=s earlier now come into play.

Second, let's write a function accepting an article and returning a DataFrame containing the words with the highest frequencies.

1  def extract_keywords(article, topn=10):
2      article_transformed = columnwise_tfidf.transform(article).toarray()[0]
3      frequencies = list(zip(vocabulary(), article_transformed))
4      frequencies.sort(key=lambda x: -x[1])
5      return pd.DataFrame(frequencies[:topn])

In[7]

Finally, let's extract keywords from one of the most iconic SCP articles: The one for SCP-682. This is one of the best examples of Keter class SCPs.

1  scp_682 = df.loc[df["Description"].str.startswith("SCP-682")].drop(columns=["Label"])
2  extract_keywords(scp_682)

In[8]

	0	1
0	proc__682	0.767357
1	desc__kia	0.738121
2	desc__682	0.523255
3	desc__agent	0.171312
4	desc__personnel	0.156161
5	proc__speak	0.153737
6	proc__acid	0.144138
7	proc__to	0.133515
8	desc__pvt	0.110179
9	proc__scp	0.107281

Out[8]

This does not look too promising. First, maybe numbers should be ignored. Then, there are words "to", "of" appearing in almost every article in english. "speak" might also not be telling much. This will only get worse if we look at the top 30 keywords.

1  extract_keywords(scp_682, topn=30)

In[9]

	0	1
0	proc__682	0.767357
1	desc__kia	0.738121
2	desc__682	0.523255
3	desc__agent	0.171312
4	desc__personnel	0.156161
5	proc__speak	0.153737
6	proc__acid	0.144138
7	proc__to	0.133515
8	desc__pvt	0.110179
9	proc__scp	0.107281
10	desc__handled	0.106319
11	proc__attempts	0.098297
12	proc__reacted	0.095920
13	desc__occurrence	0.095232
14	proc__incapacitation	0.091120
15	proc__of	0.090828
16	proc__fear	0.087715
17	proc__rage	0.087715
18	proc__hydrochloric	0.085073
19	proc__massive	0.085073
20	proc__frequent	0.082915
21	proc__provoking	0.082915
22	proc__breach	0.082463
23	desc__scp	0.081648
24	proc__should	0.080923
25	proc__lining	0.079510
26	proc__called	0.078116
27	proc__incapacitated	0.078116
28	proc__force	0.078011
29	proc__destroying	0.076869

Out[9]

Fine-tuning the `TfidfVectorizer`

Fortunately, TfidfVectorizer has a lot of options to fine-tune its behavior. First and maybe most importantly, we can enforce that certain words should be ignored via the stop_words keyword argument. It either expects the string "english" and then uses a list constructed by the sklearn developers (with its own set of disadvantages) or it expects a list of strings containing the words that shall be ignored. Second, we can specify a regex pattern via the token_pattern keyword argument. This pattern will be used when parsing the articles to build up the vocabulary. The standard pattern includes single words containing letters and numbers; we will modify it to only parse for words containing letters.

 1  columnwise_tfidf = ColumnTransformer(
 2      [
 3          (
 4              "procedures",
 5              TfidfVectorizer(
 6                  stop_words="english",
 7                  strip_accents='unicode',
 8                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
 9              ),
10              "Procedures"
11          ),
12          (
13              "desc",
14              TfidfVectorizer(
15                  stop_words="english",
16                  strip_accents='unicode',
17                  token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b'
18              ),
19              "Description"
20          )
21      ],
22      remainder=StandardScaler()
23  )
24
25  columnwise_tfidf.fit(X_train)

In[10]

  ColumnTransformer(n_jobs=None,
                    remainder=StandardScaler(copy=True, with_mean=True,
                                             with_std=True),
                    sparse_threshold=0.3, transformer_weights=None,
                    transformers=[('procedures',
                                   TfidfVectorizer(analyzer='word', binary=False,
                                                   decode_error='strict',
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1...
                                                   dtype=<class 'numpy.float64'>,
                                                   encoding='utf-8',
                                                   input='content',
                                                   lowercase=True, max_df=1.0,
                                                   max_features=None, min_df=1,
                                                   ngram_range=(1, 1), norm='l2',
                                                   preprocessor=None,
                                                   smooth_idf=True,
                                                   stop_words='english',
                                                   strip_accents='unicode',
                                                   sublinear_tf=False,
                                                   token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
                                                   tokenizer=None, use_idf=True,
                                                   vocabulary=None),
                                   'Description')],
                    verbose=False)

Out[10]

1  extract_keywords(scp_682, topn=30)

In[11]

	0	1
0	desc__kia	0.890278
1	proc__speak	0.272335
2	proc__acid	0.255331
3	desc__agent	0.206627
4	proc__scp	0.190041
5	desc__personnel	0.188352
6	proc__attempts	0.174127
7	proc__reacted	0.169915
8	proc__incapacitation	0.161413
9	proc__fear	0.155381
10	proc__rage	0.155381
11	proc__hydrochloric	0.150702
12	proc__massive	0.150702
13	proc__frequent	0.146879
14	proc__provoking	0.146879
15	proc__breach	0.146078
16	proc__lining	0.140847
17	proc__called	0.138377
18	proc__incapacitated	0.138377
19	proc__force	0.138192
20	proc__destroying	0.136168
21	proc__containment	0.135959
22	desc__pvt	0.132891
23	proc__difficulty	0.132345
24	proc__submerged	0.132345
25	proc__best	0.130666
26	desc__handled	0.128236
27	proc__chamber	0.126861
28	proc__plate	0.125041
29	proc__development	0.123843

Out[11]

This looks much better. A few remarks:

I had to google for the two abbreviations "kia" and "pvt". The first is the abbreviation for "killed in action" while the second stands for the military rank "Private".
On second thought, "speak" may contain the information that the SCP object is able to speak and, thusly, might hint at it being sapient. As sapient SCPs are probably more likely to be of class euclid or keter, this could be valuable information for a Machine Learning model.
One could start building a custom list of stop words more suitable for parsing SCP articles. In the list above, the words "best" and "called" as well as "scp" could be ignored. I will postpone this to the next part of this series of posts. Because some models give some insight in their learning process, we can use them to see if their decisions are based on filler words.

Conclusion

In this blog post, we have learned how to use Jupyter Notebooks and the pandas library to extract basic statistics from SCP articles. Furthermore, we have used a basic TF-IDF transformation to extract keywords from SCP articles.

Warm-up: Extracting the paragraph lengths

Basic version of the Article class

Validation of the label through a @property

Adding a to_dict method

Using the Article class to process the txt files

Exercises

Git tag: ex-15 beginner

Git tag: ex-16 beginner

Git tag: ex-17 intermediate

Git tag: ex-18 intermediate

Quick analysis of the lengths

Computing statistics of the lengths

Mathematical background behind TF-IDF

Word count vectors

Normalizing word counts

Inverse document frequency

Applying the TF-IDF transformation (Transcript of jupyter notebook session)

Making a train-test-split

Fitting TfidfVectorizers

Extracting keywords

Fine-tuning the TfidfVectorizer

Conclusion

Basic version of the `Article` class

Validation of the label through a `@property`

Adding a `to_dict` method

Using the `Article` class to process the `txt` files

Git tag: `ex-15` beginner

Git tag: `ex-16` beginner

Git tag: `ex-17` intermediate

Git tag: `ex-18` intermediate

Fitting `TfidfVectorizers`

Fine-tuning the `TfidfVectorizer`