As posited by Lev Tolstoy in his seminal work, Anna Karenina: “Happy families are all alike; every unhappy family is unhappy in its own way.” Likewise, all successful data science projects go through a very similar building process, while there are tons of different ways to fail a data science project. However, I’ve decided to prepare a detailed guide aimed at data scientists who want to make sure that their project will be a 100% disaster.
Practicing data science and working with real-world data and business problems is rather different than, for instance, building data science projects in Python using toy datasets. While being a part of a data science team in an enterprise, one should expect many challenges, including messy data, lack of data, unclear goals, difficult communication with business managers who want quick results, model performance in production being very different from testing performance, etc.
Therefore, to become a successful data scientist with a portfolio of outstanding projects, it is not enough to be good at coding and building machine learning models. One should further be able to approach a project strategically and consider many different factors, not only from the viewpoint of a data scientist but also from a business perspective.
However, what if you are actually not interested in succeeding in data science? In that case, read carefully through the tips provided below.
Start a project without an established goal. You can start collecting data and thinking about some models to build, even with vague goals set by business leaders like “Increase our sales!” or “Improve our recommender system.” After all, why not just play with data and see where it brings you?
Don’t explain to business managers what they can expect from the project. As you can guess, C-level executives often expect miracles from data science. Many seem to think that after hiring a very expensive data science team, their company will substantially increase its market share, and the ROI will skyrocket in the coming months. But that’s not your problem. Don’t ruin their dreams by explaining the real capabilities of machine learning.
Don’t involve domain experts in the project. You know everything you need! You’ve done data science projects in Python before! You don’t need any domain expertise to understand a business problem the company wants you to solve or relationships between different attributes or the nature of outliers. You can just calculate the correlations and remove any outliers that hinder the performance of your machine learning model.
Start a project without a clear picture of all the data you have access to. Just start. Companies collect tons of data. It shouldn’t be a problem to combine all the data from different systems and databases into one nice dataset that contains enough attributes to build a strong prediction model.
Rush through the data cleanup. Data cleaning is boring and takes too much time. Perform only the minimum amount of tasks you need to run a model, such as filling in missing values with mean values and removing outliers. All the issues regarding data normalization, balancing highly unbalanced datasets, and feature engineering are overemphasized. You are a data scientist, your job is to experiment with machine learning models, and you want to complete as many data science projects as possible. Why should you spend up to 80% of your time on data cleaning and preprocessing?
Don’t spend time on exploratory data analysis. Just as it is with data cleaning, the importance of exploratory data analysis is also overhyped when it comes to data science projects. You don’t need an in-depth understanding of your data to build a good model. All the visualizations of your data, distributions of features and outcome variables, correlation matrices, and other boring graphs and tables have zero impact on the performance of your model.
Choose the most complex model you can think of. For example, you can start with a deep neural network; the more layers, the better! You see, simple models never give the great results that deep learning does. They are only for newbies. For everyone else, it’s way more fun to play with neural networks.
Act shocked with your poorly performing model. Why is this fancy, deep learning model performing so abysmally? After all, you did your best!
Come up with an excuse. Possible excuses include but are not limited to: “This problem was not possible to solve with machine learning,” “The company doesn't have data required,” or “This industry is not ready for machine learning.” Of course, you should parrot this excuse only after working on the project for at least six months. That way, you can be sure your explanation is valid!
So, now you know how to fail a data science project! However, it is far more likely that you would prefer to succeed as a data scientist. If this description sounds more suited to your needs, then, obviously, you need to ignore all the recommendations provided above or – better yet – do the opposite.
As a bonus, I would like to provide you a final piece of advice about becoming a data scientist and building an impressive history of successful data science projects - keep learning!
Vertabelo Academy offers very engaging courses in Python, R, and SQL, which offer many interactive exercises. The courses are aimed at beginners without an IT background. There are also other data science courses available online, such as those from Coursera, edX, Udemy, and Udacity. However, these often lack interactivity and might be too challenging for non-IT learners.
Disclosure: I am a Data Science Writer at Vertabelo Academy.