After we have obtained data through the means described in the first part of this blog post series, it is time to deal with data transformations and data preprocessing. While humans can comprehend textual information in the form of articles, it is hard to digest for a Machine Learning algorithm. In this blog post, we will transform the textual information into certain vectors assigning a number to each word in the vocabulary of the set of articles: This is what the TF-IDF (Term Frequency - Inverse Document Frequency) is all about.
In comparison to the web crawler post, this one is more mathematical in nature. Instead of evaluating technical approaches and executing them in a test-driven manner, we will have to understand the mathematical background behind the algorithm to put it to good use.
To make the transition as gentle as possible, let us do a warm-up that is closer to the technical spirit of the last blog post: We use the text files produced by the web crawler from the last blog post to extract certain lengths of paragraphs and research in what way these help us to determine the Object Class of the article.
After we have understood how TF-IDF works, we can use it to transform our articles into TF-IDF vectors. Consequently, we will already be able to extract keywords from each article.
Warm-up: Extracting the paragraph lengths
To structure the text from the articles, let us use a custom class.
Basic version of the Article
class
The logic that splits up the text from the text files into attributes of the
class will be a classmethod
that accepts a list of lines of text and returns a
readily constructed Article
instance.
Here's a basic test that shows how to use the classmethod
to obtain an
Article
instance.
Validation of the label through a @property
As mentioned in the last part of this series, we are only concentrating on the labels SAFE, EUCLID and KETER. To account for this, we need to validate that the incoming label is one of those. We are a little more lenient and also accept labels that only start with one of those three labels.
Let us write tests first to define the desired behavior.
In the tests above, we are using a fixture
that gives us an
initialized Article
instance. Then, we are defining the regular behavior of
the setter (we are expecting the label to accept the three main object classes
as well as labels that start with those) and what happens when the setter
encounters an unknown label (we are expecting a ValueError
, enforced via the
raises
helper).
Because we have not written any validation for the label
attribute yet, the
tests fail. To account for these kinds of validations, Python has @property
decorators that allow for custom getter and setter methods.
The Python interpreter calls the method decorated with @label.setter
as soon
as it encounters the line self.label = label
in the __init__
method.
As a result, code that uses this class has to deal with ValueErrors
when
constructing instances.
Adding a to_dict
method
While the Article
class is responsible for extracting information from the
articles, it is much easier to use a simple dictionary when persisting extracted
information. That is because the json
library can serialize Python
dictionaries directly; additionally, the pandas
Data Science library is able
to use dictionaries to construct their main object: a DataFrame
.
As a result, we need to write a to_dict
method that turns an Article
instance into a plain dictionary. Aside from the four attributes of the
Article
class, we also require the dictionary to contain the (character)
lengths of the Procedures and the Descriptions as well as the Ratio of these two lengths.
The implementation is straightforward and uses a dictionary comprehension.
Using the Article
class to process the txt
files
Finally, we want to use the Article
class to process the text files. More
precisely, we would like to aggregate the articles into a pandas DataFrame
.
This object has a to_json
method that allows us to persist it for later introspection.
First, let us write a test to pin down our expectations.
Here, we are using the dictionary TEST_DATA
to write two files with two mock
articles. The first is a regular article with a valid object class, the second
one is an article we do not wish to process. As a result, we expect that only
one article is present in the processed data. Note that we are using
pandas' read_json
method to obtain a DataFrame
and, in turn, we are using
DataFrame
methods to assure that only one article is present and that the
article data has been split up correctly.
To make this test pass we have to implement the following strategy:
- Start with an empty
DataFrame
. - Parse each text file in the
data/raw
folder and turn it into anArticle
. - Use
Article
'sto_dict
method to append the data to theDataFrame
. - Persist the
DataFrame
to ajson
file in thedata/processed
folder.
Here's a sketch implementation that uses the pathlib
library's glob function to iterate
through the text files.
From a software design perspective, this code leaves a lot to be desired. First,
there are no log messages that will come in handy when things are off. Second,
the paths are hard-coded and should be replaced by function parameters. Third
and last, the Article
's classmethod from_text
throws a ValueError
each
time it encounters an article with an unknown object class. We have to deal with
this kind of situation without letting the entire script fail.
Here's a revision of the sketch.
Note that we are emitting log warning messages whenever we encounter an unknown label but still continue with the processing.
Exercises
Just like in the last blog post, you can rapidly tackle an exercise by using git
tags. For instance, if you want to tackle the first exercise, issue the command
git tag ex-15
and start coding. If you want to compare your solution to mine,
issue git diff sol-ex-15
when you have finished.
Git tag: ex-15
Add a __repr__
method to the Article
class.
Git tag: ex-16
Add another test for the from_text
method. The input shall be the same as in the
test_from_text
method except that you will leave out the name ("Some name ").
Consequently, assert
that the name of the resulting Article
instance
will be an empty string.
Git tag: ex-17
Unfortunately, there are some SCP articles that slightly diverge from the usual naming convention for their parts. For instance, SCP-524 has a Special Containment Procedure (singular!), SCP-2944 has Secure Containment Procedures, and SCP-931 consists of haiku paragraphs. While we could certainly be a little more thorough when parsing them, I will ignore them for the rest of this blog post series (I have encountered 130 warnings when parsing the first 3000 SCP articles which is less than 0,5% of incorrectly parsed articles). However, if you want to, feel free to optimize the parsing procedure. For starters, allow for the "Description" part to start with either "Description:" or "Summary:". Do not forget to write tests!
Git tag: ex-18
Raise RuntimeErrors
whenever the Containment Procedures or the Description
cannot be extracted. Catch these RuntimeErrors
in make_dataset.py
, log the
error and continue with the for
loop without adding the article to the
DataFrame
. Finally, add another test article in test_make_dataset.py
with an
unexpected beginning of the description and tests to test_article_data.py
to
make sure these RuntimeErrors
are actually raised.
Quick analysis of the lengths
After we have extracted the (character) lengths of the two parts of the SCP
articles, let us analyze them. We will use pandas
to load the json
file and
compute some basic statistical measures.
Open jupyter lab
(either by opening a terminal and issuing the command
jupyter lab
or by opening Anaconda, switching to the environment for the SCP
project and open Jupyter Lab there), navigate to the notebooks
folder of the
SCP project and click the "+" icon above the folder breadcrumbs to fire up a new launcher.
In the opening Launcher tab, choose a Python 3 Notebook. Now you are all set up to experiment with data interactively. The following is a transcript of my Jupyter notebook session.
Computing statistics of the lengths
We want to check that all the transformations we have done so far are sane so that we can work with a cleaned up dataset.
Description | Description_Length | Label | Name | Procedures | Procedures_Description_Ratio | Procedures_Length | |
---|---|---|---|---|---|---|---|
0 | SCP-1256 is a 24-page pamphlet entitled 'Bees … | 1837 | SAFE | Item #: SCP-1256 | Mobile Task Force Zeta-4 ('Beekeepers') is cur… | 0.224279 | 412 |
1 | SCP-2987 is a modified MSI brand external hard… | 2187 | SAFE | Item #: SCP-2987 | SCP-2987 is to be kept on floor 17 of Site-88…. | 0.203475 | 445 |
2 | SCP-2039 collectively refers to two distinct f… | 5399 | EUCLID | Item #: SCP-2039 | Presently, Foundation efforts at Research Faci… | 0.368772 | 1991 |
3 | SCP-1530 is a two-story abandoned house locate… | 3893 | EUCLID | Item #: SCP-1530 | SCP-1530 is currently contained 120 meters fro… | 0.201387 | 784 |
4 | SCP-1524 is the sole remaining specimen of a s… | 3211 | EUCLID | Item #: SCP-1524 | Both of SCP-1524's individual components are t… | 0.530364 | 1703 |
Let's look at some statistics of the extracted text lengths and the ratio.
Description_Length | Procedures_Description_Ratio | Procedures_Length | |
---|---|---|---|
count | 2700.000000 | 2700.000000 | 2700.000000 |
mean | 3208.542222 | 0.286840 | 777.595556 |
std | 1658.345674 | 0.293568 | 519.808074 |
min | 61.000000 | 0.000000 | 0.000000 |
25% | 2104.750000 | 0.145726 | 414.750000 |
50% | 2887.000000 | 0.229935 | 656.500000 |
75% | 3957.000000 | 0.353646 | 994.250000 |
max | 31618.000000 | 7.377049 | 7922.000000 |
Whereas count, mean, min and max are self-explanatory, std stands for standard deviation. The rows with percentages are the 25%-, 50%-, and 75%-quantiles, respectively. They were defined in my Blog post on means and medians. Here's a short refresher: The 25%-quantile is a value such that 25% of the data is smaller than or equal to it and the other 75% of the data is greater than or equal to it. The 50%-quantile is also known as the median.
The minimum of 61 characters in Description_Length looks reasonable but a Containment Procedure with 0 characters? This has to be investigated. Before we do so, let us look at the same statistics but grouped by each label.
Description_Length | Procedures_Description_Ratio | Procedures_Length | ||
---|---|---|---|---|
Label | ||||
EUCLID | count | 1274.000000 | 1274.000000 | 1274.000000 |
mean | 3244.361852 | 0.308139 | 855.422292 | |
std | 1701.660229 | 0.273383 | 529.896660 | |
min | 428.000000 | 0.011165 | 148.000000 | |
25% | 2179.250000 | 0.169438 | 497.250000 | |
50% | 2935.500000 | 0.259065 | 727.000000 | |
75% | 3977.750000 | 0.371186 | 1075.750000 | |
max | 31618.000000 | 6.051948 | 7922.000000 | |
KETER | count | 314.000000 | 314.000000 | 314.000000 |
mean | 3380.487261 | 0.401208 | 1128.343949 | |
std | 1694.007237 | 0.328462 | 605.260134 | |
min | 233.000000 | 0.000000 | 0.000000 | |
25% | 2243.000000 | 0.218239 | 683.250000 | |
50% | 3197.500000 | 0.332694 | 1028.000000 | |
75% | 4192.250000 | 0.486212 | 1449.750000 | |
max | 10141.000000 | 3.781726 | 3449.000000 | |
SAFE | count | 1112.000000 | 1112.000000 | 1112.000000 |
mean | 3118.951439 | 0.230143 | 589.388489 | |
std | 1592.721215 | 0.293088 | 392.807626 | |
min | 61.000000 | 0.010626 | 64.000000 | |
25% | 2003.000000 | 0.118879 | 326.000000 | |
50% | 2791.500000 | 0.178353 | 488.500000 | |
75% | 3860.750000 | 0.277565 | 730.500000 | |
max | 12331.000000 | 7.377049 | 3680.000000 |
This is where it starts to get interesting! As safe SCPs are much easier to contain than euclid ones which in turn are easier to contain than keter SCPs, we expect that the Containment Procedures are easier to describe for safe ones and need more elaborate descriptions for keter ones. On average, this is reflected in the mean length of Containment Procedures (579 for safe, 833 for euclid and 1108 for keter).
Let us turn to the problematic cases of zero lengths.
Description | Description_Length | Label | Name | Procedures | Procedures_Description_Ratio | Procedures_Length | |
---|---|---|---|---|---|---|---|
1340 | SCP-1994 is the general designation for a set … | 1376 | KETER | Item #: SCP-1994 | 0.0 | 0 |
Thankfully, this is a single outlier. Investigating the article on the
SCP Foundation web page and inspecting the html yields that the label
"Special Containment Procedures" sits in its own p
element so that we
were not able to crawl this article correctly.
Let us ignore the outlier.
Finally, let us compute correlations between our features and the target. The correlation coefficient may be computed for number-valued random variables only. Fortunately, the nominal labels safe, euclid, and keter, carry ordinal information. That is to say, we can order them by their containment complexity. To make this even more explicit, let us assign numbers to the three labels. A safe label will be converted to -1, a euclid label to 0 and a keter label to 1 so that the order of the containment complexity is reflected by $\mathrm{safe} < \mathrm{euclid} < \mathrm{keter}$. However, the magnitude of this conversion is still open for discussion. We could also have choosen $10^{100}$ for keter and this would have influenced the correlation coefficients. But let's stick to our simple way of converting for now.
Description_Length | Procedures_Description_Ratio | Procedures_Length | Complexity | |
---|---|---|---|---|
Description_Length | 1.000000 | -0.293831 | 0.220675 | 0.052532 |
Procedures_Description_Ratio | -0.293831 | 1.000000 | 0.577548 | 0.188953 |
Procedures_Length | 0.220675 | 0.577548 | 1.000000 | 0.344329 |
Complexity | 0.052532 | 0.188953 | 0.344329 | 1.000000 |
As it turns out, Complexity and Procedures_Length are positively correlated which is precisely what we have observed through the statistics that we have grouped by label. We also see that Description_Length is only very weakly correlated with Complexity: That is to say that there is no reason why, say, a safe SCP should not have a long description or why a keter SCP could not be described in a short manner.
Mathematical background behind TF-IDF
Before we get to enjoy the ease of use of sklearn
's Transformer
API to apply
the TF-IDF transformation, let's try and get some understanding of it first.
The easiest way to turn words into numbers is to count them. As simple as this
sounds, this idea is the corner stone of the TF-IDF transformation.
Word count vectors
Let me make this more precise. Assume that we have articles \(A_1, \dotsc, A_n\). The vocabulary \(\mathcal{V} = \mathcal{V}_{A_1, \dotsc, A_n}\) is the set of unique words occurring in those articles. The set of all of our articles will also be called the document. For any article \(A\), the word count function is \[ \mathrm{wc}_{A}\colon \mathcal{V} \to \mathbb{N}, \, w \mapsto \text{Number of times \(w\) occurs in \(A\)}. \]
Once we have fixed the vocabulary, it is possible to turn the word count functions into word count vectors. First, we merely need to decide on an order of the vocabulary—the alphabetic ordering is a canonical choice! As soon as we have ordered the vocabulary, we may write it as \(\mathcal{V} = \{w_1, \dotsc, w_m\}\), where \(m\) is the total number of words in the vocabulary. Finally, we declare that the word count vector of the article \(A\) is \[ v_A = \bigl(\mathrm{wc}_A(w_1), \dotsc, \mathrm{wc}_A(w_m)\bigr). \]
Normalizing word counts
Instead of only talking about single words, we can more generally deal with terms. These can also be sequences of words that appear in our documents. Depending on how long those sequences are, they may carry information about the context that cannot be inferred from single words alone. Consequently, the word count vectors from before may more generally be called term count vectors (even though this does not seem to be standard language).
In general, vectors have the advantage of being comparable by using distance measures such as the Euclidian distance. Depending on the variety of articles and the precise application, however, there might be a problem. To make this concrete, let me illustrate this by a simple artificial example. Take the sentence "I am sure I am right about this" and transform it into a word count vector. Using alphabetic ordering, you should end up with the following.
word | about | am | I | right | sure | this |
---|---|---|---|---|---|---|
count | 1 | 2 | 2 | 1 | 1 | 1 |
Let us add a second "article" to our document. The text consists of that same sentence twice. Thus, we obtain
word | about | am | I | right | sure | this |
---|---|---|---|---|---|---|
count | 2 | 4 | 4 | 2 | 2 | 2 |
as its word count vector.
If you want to consider these two articles as similar or even the same, you will need to normalize the vectors. This means that before comparing two word count vectors you will divide them by their length. In this case, this approach will lead to the two articles being seen as the same. However, there are also reasons for wanting to tell these two articles apart: Even if they deal with the same point, the second one puts stronger emphasis on it by repetition.
To sum it up, depending on the application you might want to think about normalization of your word count vectors. In any case, the resulting word count vectors will be called term frequencies (even if you did not normalize) from now on. This concludes the first half of the TF-IDF transformation.
Inverse document frequency
Term frequency vectors suffer from another problem: Words that certainly occur in almost every English article such as "a", "the", and "is" but do not carry meaning will influence the similarity of articles. There are two ways to deal with this problem (in fact, they are often used in conjunction). The first is to simply ignore those words: Create a list of so-called stop words that will be ignored when building the vocabulary. The second way is to penalize words occurring in almost every article and boost rare words. This is precisely what inverse document frequency is doing.
Before we arrive at the precise definition, let us look at it from another angle. A word count is a measure that is local to a single article. This means that it does not depend on other articles in our document. If a single word count is high in that article, this might mean that this is an important word that potentially helps characterizing the article. However, if this word count is equally high in all the other articles then this word does not help us telling this article apart from the others (if everything is special then nothing is). Thus, there is the need for a trade-off between local measures (such as a single word count of a certain article) and global measures.
Inverse document frequency is such a global measure. To define it, let us concentrate on a single word \(w\) in our vocabulary. The document frequency \(\mathrm{df}(w)\) is the number of articles that \(w\) appears in. Consequently, the inverse document frequency is \(1/\mathrm{df}(w)\).
Now we are able to describe the TF-IDF transformation as a whole: For any article \(A\) in the document, multiply each word count in its word count vector with the inverse document frequency. In formulae: \[ \mathrm{TFIDF}(A) = \left(\frac{\mathrm{wc}_A(w_1)}{\mathrm{df}(w_1)}, \dotsc, \frac{\mathrm{wc}_A(w_m)}{\mathrm{df}(w_m)}\right). \]
Applying the TF-IDF transformation (Transcript of jupyter notebook session)
Before we apply the TF-IDF transformation, it is obligatory to put aside some test data for evaluating our model later. Otherwise, a future Machine Learning model would have access to statistics of the entire dataset and may deduce statistics of the test dataset afterwards. However, the entire purpose of the train-test-split is to evaluate the model on data it has not seen before.
Making a train-test-split
With sklearn
, splitting a DataFrame
reduces to calling the
train_test_split
function from the model_selection
module. The
test_size
argument determines the relative size of the test set.
Note that we split up our target column Label
from the rest so that it
will not be included in the following transformations.
Fitting TfidfVectorizers
Since we have two text columns (Procedures
and Description
), it is
best to fit two =TfidfVectorizer=s so that all information contained in
those two separately will be preserved. The rest of the features should
be scaled as certain models encounter numerical problems when two
features are on very different scales (that is to say one feature
usually is very large, e.g. $\gg 10^6$, while another only attains
values between 0 and 1). To do all of this in one go, sklearn
provides
us with a
ColumnTransformer
that takes a list of tuples consisting of a column name and a
transformer that should transform the corresponding column.
Additionally, the ColumnTransformer
's remainder
keyword argument may
be another transformer that will be applied to the remaining columns.
Here's how to use it:
First, the first item in the tuple is a name for the transformation for
later reference. Second, the
TfidfVectorizer
with standard arguments constructs the TF-IDF vectors in almost the same
way that I explained it in the Blog Post accompanying this part of the
project. The only difference is that the document frequency of each word
is increased by one to prevent zero divisions. Third and last, the
StandardScaler
scales the remaining features such that they have zero mean and unit
standard deviation.
Applying this ColumnTransformer
to our train set follows the usual
sklearn
API. Each Transformer
has fit
and transform
methods.
Here, the first is used solely on the train set to fit the
Transformer
. Afterwards, the second may be used to transform both the
train and test set.
Conveniently, most transformers have a fit_transform
method that
combines these two steps into one:
Extracting keywords
Let us use the fitted transformers to extract keywords from articles. First, we will extract the vocabulary as determined by the =TfidfVectorizer=s. To distinguish between the words from the Procedures and the Description, we will prepend each of them with a prefix.
Note that the names we have provided for the =TfidfVectorizer=s earlier now come into play.
Second, let's write a function accepting an article and returning a
DataFrame
containing the words with the highest frequencies.
Finally, let's extract keywords from one of the most iconic SCP articles: The one for SCP-682. This is one of the best examples of Keter class SCPs.
0 | 1 | |
---|---|---|
0 | proc__682 | 0.767357 |
1 | desc__kia | 0.738121 |
2 | desc__682 | 0.523255 |
3 | desc__agent | 0.171312 |
4 | desc__personnel | 0.156161 |
5 | proc__speak | 0.153737 |
6 | proc__acid | 0.144138 |
7 | proc__to | 0.133515 |
8 | desc__pvt | 0.110179 |
9 | proc__scp | 0.107281 |
This does not look too promising. First, maybe numbers should be ignored. Then, there are words "to", "of" appearing in almost every article in english. "speak" might also not be telling much. This will only get worse if we look at the top 30 keywords.
0 | 1 | |
---|---|---|
0 | proc__682 | 0.767357 |
1 | desc__kia | 0.738121 |
2 | desc__682 | 0.523255 |
3 | desc__agent | 0.171312 |
4 | desc__personnel | 0.156161 |
5 | proc__speak | 0.153737 |
6 | proc__acid | 0.144138 |
7 | proc__to | 0.133515 |
8 | desc__pvt | 0.110179 |
9 | proc__scp | 0.107281 |
10 | desc__handled | 0.106319 |
11 | proc__attempts | 0.098297 |
12 | proc__reacted | 0.095920 |
13 | desc__occurrence | 0.095232 |
14 | proc__incapacitation | 0.091120 |
15 | proc__of | 0.090828 |
16 | proc__fear | 0.087715 |
17 | proc__rage | 0.087715 |
18 | proc__hydrochloric | 0.085073 |
19 | proc__massive | 0.085073 |
20 | proc__frequent | 0.082915 |
21 | proc__provoking | 0.082915 |
22 | proc__breach | 0.082463 |
23 | desc__scp | 0.081648 |
24 | proc__should | 0.080923 |
25 | proc__lining | 0.079510 |
26 | proc__called | 0.078116 |
27 | proc__incapacitated | 0.078116 |
28 | proc__force | 0.078011 |
29 | proc__destroying | 0.076869 |
Fine-tuning the TfidfVectorizer
Fortunately, TfidfVectorizer
has a lot of options to fine-tune its
behavior. First and maybe most importantly, we can enforce that certain
words should be ignored via the stop_words
keyword argument. It either
expects the string "english" and then uses a list constructed by the
sklearn
developers (with its own set of disadvantages) or it expects a
list of strings containing the words that shall be ignored. Second, we
can specify a regex pattern via the token_pattern
keyword argument.
This pattern will be used when parsing the articles to build up the
vocabulary. The standard pattern includes single words containing
letters and numbers; we will modify it to only parse for words
containing letters.
0 | 1 | |
---|---|---|
0 | desc__kia | 0.890278 |
1 | proc__speak | 0.272335 |
2 | proc__acid | 0.255331 |
3 | desc__agent | 0.206627 |
4 | proc__scp | 0.190041 |
5 | desc__personnel | 0.188352 |
6 | proc__attempts | 0.174127 |
7 | proc__reacted | 0.169915 |
8 | proc__incapacitation | 0.161413 |
9 | proc__fear | 0.155381 |
10 | proc__rage | 0.155381 |
11 | proc__hydrochloric | 0.150702 |
12 | proc__massive | 0.150702 |
13 | proc__frequent | 0.146879 |
14 | proc__provoking | 0.146879 |
15 | proc__breach | 0.146078 |
16 | proc__lining | 0.140847 |
17 | proc__called | 0.138377 |
18 | proc__incapacitated | 0.138377 |
19 | proc__force | 0.138192 |
20 | proc__destroying | 0.136168 |
21 | proc__containment | 0.135959 |
22 | desc__pvt | 0.132891 |
23 | proc__difficulty | 0.132345 |
24 | proc__submerged | 0.132345 |
25 | proc__best | 0.130666 |
26 | desc__handled | 0.128236 |
27 | proc__chamber | 0.126861 |
28 | proc__plate | 0.125041 |
29 | proc__development | 0.123843 |
This looks much better. A few remarks:
- I had to google for the two abbreviations "kia" and "pvt". The first is the abbreviation for "killed in action" while the second stands for the military rank "Private".
- On second thought, "speak" may contain the information that the SCP object is able to speak and, thusly, might hint at it being sapient. As sapient SCPs are probably more likely to be of class euclid or keter, this could be valuable information for a Machine Learning model.
- One could start building a custom list of stop words more suitable for parsing SCP articles. In the list above, the words "best" and "called" as well as "scp" could be ignored. I will postpone this to the next part of this series of posts. Because some models give some insight in their learning process, we can use them to see if their decisions are based on filler words.
Conclusion
In this blog post, we have learned how to use Jupyter Notebooks and the pandas
library to extract basic statistics from SCP articles. Furthermore, we have used
a basic TF-IDF transformation to extract keywords from SCP articles.