Breaking the Jargons #2: June Edition
Automating your data science projects, reviewing Hugging face course and more...
Hi there!
Welcome to the second edition of this newsletter. This edition brings you a mix of various articles ranging from course reviews, useful open-source libraries for machine learning to tips for automating your projects. I hope you enjoy the read.
📜 Articles
Here are some of my favorite articles published in June:
Reviewing the recently released HuggingFace 🤗 Course
I reviewed the recently released Hugging Face course. I look at the course content, its offerings, and whether or not it ticks the right boxes for us.
Five Open-Source Machine learning libraries worth checking out
In this article, I present a quick tour of some of the libraries that I recently encountered and which could be a great supplement to your machine learning stack. These are not your basic EDA libraries but advanced libraries which compile trained traditional machine learning models into tensor computations, a topic modeling technique that leverages BERT embeddings and libraries enabling interpretability for Pytorch models.
Beware of the Dummy variable trap in pandas.
Handling categorical variables forms an essential component of a machine learning pipeline. There are many ways to encode categorical variables, and pandas’ dummy variable encoding is one. However, this encoding technique comes with its own limitation, and in this article, I present some workarounds to save ourselves from the trap.
Automate your data science project structure in three easy steps
Have you ever found yourself in a situation where it became difficult to decipher your codebase? Do you often end up with multiple files like
untitled1.py
oruntitled2.ipynb
? The situation is even grimmer in data science. Often, we limit our focus on the analysis and the end product while ignoring the quality of the code that is responsible for the analysis. In this article, I share my three favorite tools to help organize and structure your projects in a reusable and reproducible format.Building a compelling Data Science Portfolio with writing
Writing in Data Science can have a transformative effect not only in your journey but also in your career. I appeared on the FastBook Reading Sessions organised by Weights & Biases to discuss the same. I wrote this piece to summarize what I covered there. Primarily it discusses why writing matters in data science and how it can be used as a tool to leverage your portfolio.
🎙️ Interviews
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.
This time I got to interview Dmitry Gordeev, also known as dott in the Kaggle world. He is a Kaggle Competition’s Grandmaster and a Senior Data Scientist at H2O.ai. In this interview, Dmitry talks about his recent win in the Indoor Location & Navigation competition on Kaggle and his approach to Data Science in general.
🔬 Research Papers Recommendations
The research paper I found pretty interesting this month:
Tabular Data: Deep Learning is Not All You Need
This paper compares the effectiveness of the recently proposed Deep learning frameworks for Tabular datasets. The authors examine Tabnet, Neural Oblivious Decision Ensembles (NODE), DNF-Net, and 1D-CNN deep learning models and compare their performance on eleven datasets with XGBoost. Out of the eleven datasets, nine datasets were derived from the papers of these deep learning models. The authors conclude the following vital points via their study:
The XGBoost model generally outperformed the deep models.
In most cases, the deep learning models perform worse on datasets that did not appear in their original papers,
No deep model consistently outperformed the others.
However, the ensemble of deep learning models and XGBoost outperforms the other models in most cases.
Finally, in the words of the authors:
while significant progress has been made using deep models for tabular data, they still do not outperform XGBoost, and further research is needed in this field. Our somewhat improved ensemble results provide another potential avenue for further research.
💡 Concept corner
I find it fascinating when people break down complex machine learning concepts in easy-to-understand bits. Edwin Chen has this wonderful piece on the intuition behind Random Forests 🌳🌳🌳. If you are new to machine learning, this’ll help you grasp the concept, and if you are a veteran, you’ll enjoy the analogy.
🎁 Resource of the Month
A new and free OpenCV course has been released by freeCodeCamp.org in association with the creators of OpenCV. The course teaches a wide range of exciting topics like Image & Video Manipulation, Image Enhancement, Filtering, Edge Detection, Object Detection, Tracking, Face Detection, and the OpenCV Deep Learning Module.
That is all for this edition. See you with another roundup next month. You can subscribe to receive the newsletter directly in your mailbox every month or share it with someone who could find them helpful.
Until next month,
Parul