Data Science downsides

Reading the introductory What is Data Science article you might have the wrong idea that data science is almost perfect and can solve any problem.

In this post you will discover the data science downsides and gaps and the many problems and mistakes that you can make if you don’t keep in mind some considerations.

Post content:

1. Data Science hype
2. Data Science examples failures
3. Data Science common mistakes
4. Data Science difficulties


1. Data Science hype

It is common for emerging technologies to experience excessive expectations or overvaluation in its early years of existence. This phenomenon is known as the Gartner Hype Cycle, which provides a graphical and conceptual presentation of the maturity and adoption of emerging technologies through five phases:

  1. Technology Trigger: A potential technology breakthrough kicks things off. Early proof-of-concept stories and media interest trigger significant publicity. Often no usable products exist and commercial viability is unproven.
  2. Peak of Inflated Expectations: Early publicity produces a number of success stories — often accompanied by scores of failures. Some companies take action; many do not.
  3. Trough of Disillusionment: Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investments continue only if the surviving providers improve their products to the satisfaction of early adopters.
  4. Slope of Enlightenment: More instances of how the technology can benefit the enterprise start to crystallize and become more widely understood. Second- and third-generation products appear from technology providers. More enterprises fund pilots; conservative companies remain cautious.
  5. Plateau of Productivity: Mainstream adoption starts to take off. Criteria for assessing provider viability are more clearly defined. The technology’s broad market applicability and relevance are clearly paying off.

Gartner places Data Science at the beginning of phase 2, Peak of Inflated Expectations, while Big Data is placed at the end of the same phase:

3.1 - Data science hype

This hype implies inflated expectations about what can be solved with data science, or more specificaly, with data. It is a misconception that we can throw more and more data to an algorithm and solve any problem. The title of the article The End of Theory: The Data Deluge Makes the Scientific Method Obsolete is an example.

But if there is one term generating more hype than data science that is “Big Data“. The list of articles and news claiming the “Big Data revolution” is long. Some examples:

The content of these articles is far from being wrong or inadecuate, but it’s easy to see that the expectations on big data are high, and that may disappoint many people, at least in the short term.


2. Data Science examples failures

Do you remember the examples showed in the first post (Data Science examples)? Specifically the Netflix prize and Google Flu project. We didn’t explain it all.

2.1. Netflix prize failure

Netflix held a competition awarded with a $1 million prize to a developer team for an algorithm that increased the accuracy of the company’s recommendation engine by 10%.

Even with such a big reward, Netflix doesn’t use the algorithm, and has no plans to implement it in the future. The reason, in Netflix’s words:

We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.

Also, our focus on improving Netflix personalization had shifted to the next level by then.

Of course the competition was not a total failure (Netflix still use two algorithms from the team that won the first Progress Prize), but this is a good example of the non-perfection of data science.

You can find the details of the Netflix Prize and the recommendation system in The Netflix Tech Blog.

2.2. Google flu trends failure

Google Flu Trends served as one of the early examples of exploration of how real-world phenomena could be modeled using patterns in search queries. Nature letter “Detecting influenza epidemics using search engine query data” explains how the project has provided useful insights.

The idea of predict the future based on people’s searches was (and is) an interesting one, but after some years the project started to fail. In the 2011-2012 season, Google flu trends overestimated doctors’ visits more than 50 per cent. In 2013, during the peak of the flu season, the error was as high as 140 percent.

Does this failure means that it’s not possible to extract knowledge from data to make predicitons? Not at all, but it serves us to notice the difficulties in doing it. You can find more details of this in the article What We Can Learn From the Epic Failure of Google Flu Trends. For example, one of the reasons of the failure:

Google’s algorithm was quite vulnerable to overfitting to seasonal terms unrelated to the flu, like “high school basketball.” With millions of search terms being fit to the CDC’s data, there were bound to be searches that were strongly correlated by pure chance, and these terms were unlikely to be driven by actual flu cases or predictive of future trends.


3. Data Science common mistakes

Data scientists, like other scientists, usually can’t avoid making mistakes, and sometimes some kind of mistakes can lead to new discoveries. But we have to remember that data scientists are hired at a high cost to the companies that hire them. It is important to every data scientist to avoid bad practices and repeated mistakes.

In this section we will list the most common mistakes in data science and how to avoid them.

3.1. Start analyzing without a question

“Would you tell me, please, which way I ought to go from here?”

“That depends a good deal on where you want to get to.”

“I don’t much care where…”

“Then it doesn’t matter which way you go.”

3.2 - Start analyzing without a question

Taking some data and start ploting graphs and searching correlations can be fun (and it’s a good way to learn or try a new technology). But if you want to solve real problems with your data you should know where you want to get to.

A good data science process has to be structured, starting with one or more well-defined question or problem. If you miss this step, you will find yourself changing the question you are trying to answer before you really answer it.

During any analysis, it is possible to come up with new ideas or discoveries that were not part of the initial hypotheses, but they should help you to solve your original problem, no forcing you to change it.

3.2. Using poor quality data

Once you have question to answer or a problem to solve, you need the right data to do it. Using poor quality data comes with the risk of not solving any problem or, even worse, lead you to wrong answers or conclusions. This is a waste of time and money you want to avoid.

According to CIO article Disconnect between CIOs and LOB managers weakens data quality, a survey of 200 senior IT and business leaders from large enterprises shows that most analytics projects suffer from bad data syndrome:

CIOs know that bad data is bad for business, but they aren’t putting enough time or resources behind their data quality efforts. That’s dangerous at a time when businesses are seeking competitive advantages with their data.

Based on the results of the study, the main causes for poor data quality are:

3.3 - Causes for poor data qualities

And the data attributes that need to be addressed are:

3.4 - Data attributes that need to be addressed

3.3. Focus only on technology

Technology is important, it’s the tool that allow data scientist to work with big data. But technology is not everything, not even the most important thing of a data science project.

You can make good data analysis with good data, knowledge and the right techniques even if you don’t use the best technology, but the other way is not possible. Even with the latest cutting edge language or big data library you can’t go far if you miss the right data and knowledge.

As a data scientist, try to do your best with what you have and focus more on solving problems than on tools.

3.4. Confuse correlation and causation

As we saw before, we can’t rely only on data to obtain good results. It’s possible to look at a dataset and see things that seem to be true but that can’t be generalized. Quoting economist Ronald Coase:

If you torture the data long enough, it will confess.

Confusing correlation with causation is a common mistake made while analyzing large ammounts of data, and can lead to wrong decisions and false predictions. Let’s see the difference:

  • Correlation: we say there is a correlation between X and Y when it is common to observe both of them at the same time.
  • Causation: we say there is causation between X and Y, or X causes Y, when the observation of X implies the observation of Y.

The mistake is to forget that a correlation between two variables does not imply that one causes the other. Usually a correlation without a cause is not sufficient to make conclusions from the data, not to say to make important decisions.

This famous joke by xkcd summarize it well:

3.5 - Correlation vs causation

If you’re still not convinced of the difference between the two concepts, Tyler Vigen has made a great collection of examples of crazy correlations in his site Spurious Correlations. Some examples:

3.6 - Correlation example 1

3.7 - Correlation example 2

You may think it’s obvious that the correlation of the examples doesn’t imply causation, but in less absurd cases it’s easy to fall in the trap.

3.5. Failing to communicate the results

Communication is one of the habilites that any data science needs to master. Usually is not difficult for a technical person to explain a problem or a solution to a colleague with the same technical background, but when it comes to simplify explanations (avoiding jargon or complex terms) for a non-technical audience, things become more difficult.

Business skills and visualization help in solving this mistake. With business skills, data scientists can understand requirements coming from marketing group, product development teams, and corporate executives. Visualization, on the other hand, allows data scientist to create visual representations of patterns and analyses, which helps providing evidence in a more persuasive manner.

3.6. Complicate the analysis too much

Most of the problems solved by data scientists are not easy and therefore they don’t have a simple solution. But in general, an over-complicated solution is not a good solution.

There is a concept in machine learning and statistics called overfitting, which refers to this mistake of over-complicating the analysis. Overfitting occurs when a statistical model becomes too complex because it’s being affected by noise or random error, reducing its predictive power. According to Wikipedia:

Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

3.8 - Overfitting example

3.7. Failing to always keep learning

It is really hard to know everything you need to be considered a data scientist (we talk about it in section 4), but if you reach that level, don’t think that you are done with learning. This is a characteristic shared with computer science, and it’s something that makes the field both challenging and interesting.

This never-ending learning process comes from the fact that technology changes at a fast pace. As a data scientist, you may have a good knowledge of mathematics and statistics without learning anything new on the field for years. But things are different regarding technology and algorithms. If you don’t want to fall behind you need the mindset of always-learning, or you can find yourself asking “what is this new technology/algorithm/technique everyone is talking about?


4. Data Science difficulties

Data science is not easy. Somebody interested in becoming a data scientist sooner or later will see the great Curriculum via Metromap by Swami Chandrasekaran:

3.9 - Data science roadmap

And that’s scary, but it really summarizes the long and hard path to become a data scientist.

On the other side, the same aspiring data scientist might read the article Why becoming a data scientist might be easier than you think by Derrick Harris and get the wrong idea that it only takes some online courses to become a data scientist. Online courses are great, but they are only one more resource of the many available.

It’s not too hard to learn some programming language and execute a machine learning algorithm from a library using a clean dataset, but that’s not exactly to be a data scientist. The word “scientist” implies that you create and test hypotheses following the scientific method:

3.10 - The scientific method

It’s also important to note that real life data science, the one you find in companies and advanced research, is very different (and harder) that the one you can find in online tutorials and courses. Brian Caffo, professor at Johns Hopkins University, shows the differences between the attributes of the perfect data science experiment and the real life in his class The perfect Data Science Experience:

Attributes of the perfect data science experiment

  • Clearly defined hypotheses of interest, specified a priori.
  • Experimental design available:
    • Randomization used across a treatment of interest.
    • Stratification on nuisance variables.
  • Random sample from a population of interest.3.11 - The perfect data science experiment
  • Data directly able to interrogate hypotheses.
  • Dataset creation/merging goes smoothly.
  • No missing data or dropout.
  • Analysis is robust without need for advanced modeling.
  • Conclusions are clear:
    • Parsimonious knowledge gained via the experiment.
  • Decision is obvious given the data.

Attributes of data science in real life

  • Data is needed to inform hypotheses and interrogate them.
  • Multiple comparisons are an issue.
  • Experimental design options a3.12 - Data science in real lifere limited. Or, data is observational:
    • Randomization is not available.
    • Data is retrospective.
  • The population being studied isn’t the population of interest.
  • The data don’t have the exact measurements that you need to evaluate the hypotheses.
  • Dataset is problematic:
    • Merging is problematic with multiple matches, no matches.
    • Data entry errors.
  • Missing data.
  • Advanced modeling is required, similarly advanced computing needed to fit the model, issues with robustness and bugs.
  • Conclusions are indeterminant.
  • Decision is not substantially further informed by the data.

This is the second of a series of articles that will introduce you to the data science field. In following articles we will go into detail on the skills, mindset, tools and technologies needed for becoming a data scientist.

If you like what you just read, please share it and make sure you are subscribed to our email newsletter.

2 thoughts on “Data Science downsides

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s