A quick way to streamline your data science project structure

May 22, 2023

Dr. Rachael Tatman, in one of her presentations, emphasized the significance of code reproducibility with a poignant statement:

“Why should you care about reproducibility? Because the person most likely to need to reproduce your work… is you.”

This statement holds true on multiple levels. Have you ever found yourself struggling to understand your own codebase? Do you frequently end up with files named untitled1.py or untitled2.ipynb? Well, if not all, many of us have undoubtedly experienced the consequences of poor coding practices. This situation is even more prevalent in the field of data science. Often, we prioritize analysis and the final results while neglecting the quality of the underlying code responsible for conducting the analysis. Here is a useful tool that can streamline and help you in creating structured and reproducible projects.

Automating project template creation with Cookiecutter Data Science

The machine learning community lacks a clear consensus on best practices for organizing projects, resulting in a multitude of choices and causing confusion. However, there is a solution available courtesy of DrivenData. They have developed a tool called Cookiecutter Data Science, which provides a standardized yet flexible project structure for performing and sharing data science work. By writing just a few lines of code, you can set up a comprehensive directory structure that simplifies project initiation, organization, and collaboration. You can visit their project home page for further information about this tool. Let's dive into the exciting part and witness it in action.

Installation

pip install cookiecutterorconda config --add channels conda-forge
conda install cookiecutter

Starting a new project

Head over to your terminal and run the following command. It will automatically populate a directory with the required files.

cookiecutter https://github.com/drivendata/cookiecutter-data-science

Using Cookiecutter DataScience | Image by Author

A sentiment analysis project directory gets created on the specified path, which in the above case is the Desktop.

The directory structure of the newly created project | Image by Author

Note : Cookiecutter data science will be moving to version 2 soon, and hence there slight change in how the command is used in the future. This means you will have to use ccds ... rather than cookiecutter ... in the command above. As per the Github repository, this version of the template will still be available but one would have to explicitly use -c v1 to select it. Keep an eye on the documentation, when the change happens.

Cookiecutter Data Science provides researchers and data scientists with a well-organized project directory using minimal code. This efficient approach fosters collaboration, enhances reproducibility, and maintains a consistent project structure among team members.

Breaking the Jargons

Discussion about this post