The Data Science Cookiecutter template is a great way to quickly set up your Data Science project. For instance, I have used and recommended it for my Machine Learning project as well as for a Data Analysis project at work. In this blog post, I want to emphasize four reasons why I created my own fork and will stop using the Data Science Cookiecutter template for future projects.
The reasons
The project repository is moving slowly
As of this writing, there have been 5 accepted commits in the master
branch in
2019.
Certainly, one could argue that this is due to the project being stable and
close to being finished. In contrast, however, there are 30 open issues and 11
pull requests with a lot of discussion. In particular, there is an approved pull request
that encompasses multiple feature requests. Even so, it has not been merged into
master as of this writing and is open since March 2019.
The Data Science Cookiecutter template does not provide you with a test setup
Second, there is no test setup at the moment. There is an open pull request that suggests on adding a test folder parallel to the project folder.
The Data Science Cookiecutter template does not provide you with a choice of requirements management
Third, even though there is a requirements.txt
in the Cookiecutter template
with sensible defaults, it might not work on your system. For instance, I cannot
install scikit-learn
via pip
. Instead, I have to rely on using conda
.
Unfortunately, the template does not provide me with an option to choose the
package manager. Again, there is a lot of discussion in an open issue.
There are no pre-defined make
targets for recurring tasks
Finally, there are tasks that you will deal with time and time again like
splitting your dataset into a train and a test set, train a collection of models
on the train set and, finally, evaluate them on the test set. Apart from the
choice of which models to train and what kinds of metrics to use to evaluate
them, these tasks are the same everytime. Consequently, they should be automated
via make
targets.
Alternatives to the Data Science Cookiecutter template
If you have read this far and have agreed with (some of) the reasons, you might
wonder what alternatives to using the Data Science Cookiecutter templates there
are. In fact, there are a lot: As of this writing, there are 943 forks of the
project on github. I am particularly fond of the Cookiecutter EasyData template.
It provides you with a rich setup of additional make
targets as well as
support for conda
's environment.yml
. Furthermore, there is lots of example
code for data transformations. As for the cons, I find the test setup too
minimal. More precisely, the code supplied in the project folder is not tested.
Instead, there is one single test file illustrating testing with python's
builtin unittest
module. Plus, usage of the project template seems to be quite
sophisticated and it is not well-documented enough. After the maintainers have
finished the tutorial project, this might be a good choice. I'll definitely keep
an eye on this project!
After evaluating a few more templates, each with their own strengths and
weaknesses, I have finally decided to fork the Data Science Cookiecutter
template to add the functionality I need myself. I suggest that you do too:
Think of all the Data Science projects you have done so far and answer the
following question: What kind of functionality did you need in all of them?
Then, build that functionality into the Data Science Cookiecutter template
yourself. As already mentioned, there are lots of examples to gain inspiration
from. Additionally, the process of building the template yourself and thinking
about it may expose weaknesses and bottlenecks of your current workflow: You may
realize that in all of your projects you have spent time on a task that can be
automated via a make
target!
To sum it up, building your own Data Science template over time with the Data Science Cookiecutter template as a starting point will get rid of its weaknesses and empower your own Data Science workflow. If you need some inspiration, check out the forks of the Data Science Cookiecutter template. For reference, here is my own fork: GriP on Data Science.