Classifying SCPs, Part 2: Data transformation (TF-IDF) and preprocessing

After we have obtained data through the means described in the first part of this blog post series, it is time to deal with data transformations and data preprocessing. While humans can comprehend textual information in the form of articles, it is hard to digest for a Machine Learning algorithm. In this blog post, we …

Quickfix: jupyter nbconvert with clear-output flag not working

Jupyter comes with a command line utility jupyter nbconvert that can transform a jupyter notebook into different formats. Furthermore, it has options to tweak the output. For instance, the execute flag executes the notebook before transforming it while the clear-output flag is supposed to remove outputs from the notebook. Thus, if you want to execute …

Classifying SCPs, Part 1: Building a web crawler

This first blog post in the series of our Data Science project of classifying SCPs is concerned with the starting point of any data-related problem, really: How do we obtain data? Fortunately, the data we want to build a Machine Learning model upon is readily available in the form of articles with standardized URLs. Thus, …

A Machine Learning Project: Classifying SCPs, Part 0, Overview

Tutorials on Machine Learning tend to emphasize the process of training models, interpreting their scores and tuning hyper-parameters. While these are certainly important tasks, most tutorials and documentations are not concerned with how the data is obtained; mostly, a toy data set is used that is provided by the Machine Learning library. In a series …

The Bayes classifier

The Bayes classifier is a Data Scientist’s unbeatable arch enemy. When faced with the challenge of finding an optimal solution to a classification problem, it is the ultimate (theoretical!) benchmark when evaluating the performance of a classifier. The problem with the Bayes classifier is that it uses information that generally is not available to us …