Why I created my own fork of the Data Science Cookiecutter template machine learning

The Data Science Cookiecutter template is a great way to quickly set up your Data Science project. For instance, I have used and recommended it for my Machine Learning project as well as for a Data Analysis project at work. In this blog post, I want to emphasize four reasons why I created my own fork and will stop using the Data Science Cookiecutter template for future projects.

The reasons

The project repository is moving slowly

As of this writing, there have been 5 accepted commits in the master branch in 2019. Certainly, one could argue that this is due to the project being stable and close to being finished. In contrast, however, there are 30 open issues and 11 pull requests with a lot of discussion. In particular, there is an approved pull request that encompasses multiple feature requests. Even so, it has not been merged into master as of this writing and is open since March 2019.

The Data Science Cookiecutter template does not provide you with a test setup

Second, there is no test setup at the moment. There is an open pull request that suggests on adding a test folder parallel to the project folder.

The Data Science Cookiecutter template does not provide you with a choice of requirements management

Third, even though there is a requirements.txt in the Cookiecutter template with sensible defaults, it might not work on your system. For instance, I cannot install scikit-learn via pip. Instead, I have to rely on using conda. Unfortunately, the template does not provide me with an option to choose the package manager. Again, there is a lot of discussion in an open issue.

There are no pre-defined `make` targets for recurring tasks

Finally, there are tasks that you will deal with time and time again like splitting your dataset into a train and a test set, train a collection of models on the train set and, finally, evaluate them on the test set. Apart from the choice of which models to train and what kinds of metrics to use to evaluate them, these tasks are the same everytime. Consequently, they should be automated via make targets.

Alternatives to the Data Science Cookiecutter template

If you have read this far and have agreed with (some of) the reasons, you might wonder what alternatives to using the Data Science Cookiecutter templates there are. In fact, there are a lot: As of this writing, there are 943 forks of the project on github. I am particularly fond of the Cookiecutter EasyData template. It provides you with a rich setup of additional make targets as well as support for conda's environment.yml. Furthermore, there is lots of example code for data transformations. As for the cons, I find the test setup too minimal. More precisely, the code supplied in the project folder is not tested. Instead, there is one single test file illustrating testing with python's builtin unittest module. Plus, usage of the project template seems to be quite sophisticated and it is not well-documented enough. After the maintainers have finished the tutorial project, this might be a good choice. I'll definitely keep an eye on this project!

After evaluating a few more templates, each with their own strengths and weaknesses, I have finally decided to fork the Data Science Cookiecutter template to add the functionality I need myself. I suggest that you do too: Think of all the Data Science projects you have done so far and answer the following question: What kind of functionality did you need in all of them? Then, build that functionality into the Data Science Cookiecutter template yourself. As already mentioned, there are lots of examples to gain inspiration from. Additionally, the process of building the template yourself and thinking about it may expose weaknesses and bottlenecks of your current workflow: You may realize that in all of your projects you have spent time on a task that can be automated via a make target!

To sum it up, building your own Data Science template over time with the Data Science Cookiecutter template as a starting point will get rid of its weaknesses and empower your own Data Science workflow. If you need some inspiration, check out the forks of the Data Science Cookiecutter template. For reference, here is my own fork: GriP on Data Science.