Data science has hit all sectors of industry and academia. From business, education, health care, scientific sector to the government, there is increasing interest in extracting information and benefit from the available data.
In this post you will find an introduction and a detailed description of what Data Science is.
1. Data Science motivation
2. Data Science definition
3. Data Science role
4. Data Science history
5. Data Science examples
1. Data Science motivation
Before we get started with the definition of data science, it is important to know what are the motivations to become a data scientist.
Be described as the Sexiest Job of the 21st Century (Harvard Business Review – October 1, 2012) and hot new field [that] promises to revolutionize industries from business to government, health care to academia (The New York Times – April 11, 2013) has helped in this increasing interest, and since then, the number of people interested in learning more about this field has not stopped growing.
The growth of available data has not been accompanied by a growth in people with the ability to analyze those data. According to Cisco (Big Data Universe Beginning to Explode), data production will be 44 times greater in 2020 than it was in 2009, while the number of people with data skills to work with data is not increasing at the same pace.
When it comes to money, data scientist salaries are hard to beat. Let’s take a look.
In its Data Science Salary Study, Burtch Works reveals the following:
The median salary of a junior level data scientist is $91,000, but those managing a team of ten or more data scientists earn base salaries of well over $250,000.
When changing jobs, data scientists see a 16 percent increase in their median base salary.
Searching “Data Scientist” in indeed.com we can see that average data scientist salaries for job postings nationwide are 113% higher than average salaries for all job postings nationwide, with an average salary of $123,000:
- Lead Data Scientist: $108,000
- Big Data Scientist: $123,000
- Senior Data Scientist: $112,000
- Data Scientist: $112,000
- Principal Data Scientist: $102,000
- Data Scientist Machine Learning: $104,000
- Big Data: $127,000
Google trends is also a great tool to see the growing interest in the field. Here we show the interest over time of the terms “data scientist” and “statistician“:
We can conclude that the world needs Data Scientists, and it’s going to pay well for them.
Still not motivated? In section 4 of this post you will find examples of what can be done with data science.
2. Data Science definition
Now that we know the opportunities offered by the field, we’re ready to describe what data science is.
According to Wikipedia:
Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
Let’s see in more detail the “interdisciplinary field” part of the definition. One common starting point is the Data Science Venn Diagram by Drew Conway:
You can find Data Science in the intersection of Hacking Skills, Math & Statistics Knowledge and Substantive Expertise. Those are the pillars of data science; computing, statistics, mathematics and quantitative disciplines combined to analyze data for better decision making.
Josh Wills described a data scientist with a quote that has become popular:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
— Josh Wills (@josh_wills) May 3, 2012
Let’s see now each of these categories on detail.
2.1. Hacking Skills
Hacking or computing skills refers to the ability to build things and find clever solutions to problems. Data scientist need to access, collect, clean and store data, and this is done using several techniques and technologies.
Technical skills needed by data scientist include the following ones:
- Programming/Coding: Python and R programming languages are the go-to for data scientist. Python is more general and R is more statistics related, but a common charasteristic of both languages is a great community and huge number of libraries and packages. Examples of libraries to work with data are Numpy, SciPy and Pandas, and some other programming languages include Java, SAS and MATLAB.
- Databases: Data is usually stored in databases, and a data scientist needs to know how to access and extract from them. Common used databases are MySQL, PostgreSQL, Cassandra, MongoDB and CouchDB.
- Visualization and reporting: A common final step of a data science project is showing the results and conclusions of the analysis in a visual an easy way to be understood by both technical and non-technical people. Examples of common visualization and reporting tools are D3, Tableau, Qlikview and R Markdown.
- Big Data technologies: A data scientist soon or later will need to work with data too big to be stored on a single computer. Once you reach this point it becomes necesary the use of new technologies and techniques, such as Hadoop, MapReduce and Spark.
2.2. Math & Statistics Knowledge
Mathematics and statistics skills are a must to any data scientist. Statistics is the core of a data analysis project, while linear algebra and calculus are fundamental to understand and apply machine learning algorithms. Being good at maths helps you understand deeply the algorithms, and therefore understand when they are usable or not.
If you search the intersection of hacking skills and substantive expertise (without math & statistics knoledge) in the Venn diagram above you will see the label “Danger Zone!“, and that’s a perfect description. Without math and stats knowledge you will just pick algorithms from a library and use them without really understanding what are you doing.
Math and statistics skills needed by data scientist include the following ones:
- Linear algebra: matrix algebra appears in a lot of machine learning concepts.
- Calculus: multivariate calculus is used in some statistics concepts and machine learning algorithms.
- Descriptive statistics: the first approacch to a new dataset involves applying descriptive analysis.
- Statistical inference: inference is needed in predictive techniques to generate new knowledge, not just describe the actual data.
2.3. Substantive Expertise
Substantive expertise or domain knowledge allows data scientist to start data projects with a clear goal. The ability to ask good questions requires domain understanding, that’s why a data scientist can’t create data based solutions without a good industry knowledge. According to Brandon Rohrer, there are only five questions data science can answer (What questions can data science answer? – KDnuggets):
- Is this A or B? (two-class classification) or Is this A or B or C or D? (multi-class classification).
- Is this weird? (anomaly detection).
- How much/how many? (regression).
- How is it organized? (clustering).
- What should I do next? (reinforcement learning).
This area also includes the ability to communicate clearly and effectively about the patterns found in data.
A more formalized definition of data science comes from NIST’s Big Data Framework:
Data science is the empirical synthesis of actionable knowledge from raw data through the complete data lifecycle process.
We finish this section with a summary of the data science process created for the Harvard CS109 Data Science course:
3. Data Science role
In this section we will try to know more about the data scientist role.
3.1. Typical background
In the “Become a data scientist in 8 steps” infographic by DataCamp we can see the data scientist typical background:
According to the graphic, 77% of data scientists have at least college education, but notice the low 9% por PhDs. Does this mean that you need a master or PhD to become a data scientist? Not necessarily. Quora question “Do I need a Masters/PhD to become a data scientist?” has some good answers to this. To sum up, candidates coming out of certain MS/PhD programs may have advantages, but you can have the right background thanks to the free resources to learn data science online.
3.2. Work experience and roles
And what about the work experience and roles? In the “The data behind today’s data scientists” infographic by CrowdFlower we can see the following data:
3.3. Data scientist groups
In the report Analyzing the Analyzers, authors Harlan Harris, Sean Murphy, and Marck Vaisman examine their survey of several hundred data science practitioners. Based on their survey, they were able to identify five groups of data scientists (descriptions from the post “5 groups of Data Scientists: Which group are you in?“):
- Data Businessperson: mainly found in project management roles. They are great at dealing with other professionals and have a comprehensive knowledge in data science process.
- Data Creative: they are good at doing day-to-day work of a data scientist (statistics, programming and big data technologies).
- Data Developer: their day-to-day work involves getting data from different sources and sorting the data in large databases, querying those databases, and analyzing the results to derive meaningful information from them.
- Data Researcher: they come from the academic world and have a strong background in statistics. They also tend to have PhDs. Business skills are not their strength, but they are excellent analysts.
- Data scientists Generic: similar to data business people but without the immense experience or the intense business focus. They are more balanced than the other four types of data scientists, because they are flexible like data science creative, but with a better understanding of the business world.
4. Data Science history
We saw before that data science is the response to the growing amount of data available and created every second in the world during the last years. But the term “data science” was originally coined in 1997. In November of that year, Chien-Fu Jeff Wu gave a lecture entitled “Statistics = Data Science?” at the University of Michigan. He popularized the term “data science” and said that statistics should be renamed data science and statisticians data scientists since they spent most of their time manipulating and experimenting with data.
In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate “advances in computing with data” in his article “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics“. In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation and theory.
In April 2002, the International Council for Science started the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues. In January 2003, Columbia University began publishing The Journal of Data Science, to provide a platform for all data workers to present their views and exchange ideas.
A great infographic about the history of data science can be found in the presentation Impact of Big Data on Analytics by Capgemini:
You can find more about the history of data science in the Forbes article A Very Short History Of Data Science.
5. Data Science examples
The results of the work done by data scientist can be found in products we all use every day, like Google, Amazon and Facebook. But first let’s see some of the most famous data science examples:
5.1. Netflix prize
The Netflix Prize was an open competition held by Netflix for the best algorithm to predict user ratings for films and discover how much someone is going to enjoy a movie based on their movie preferences.
The competition was awarded with a US$1,000,000 prize. The winner team outperformed Netflix’s own algorithm for predicting ratings by 10.06%.
5.2. Google flu trends
Google Flu Trends is a famous example of a data science application. The idea behind this project is to reveal if there is the presence of flu-like illness in a population aggregating and analyzing the large number of Google search queries. Google Flu Trends compared these findings to a historic baseline level of influenza activity for its corresponding region and then reports the activity level as either minimal, low, moderate, high, or intense.
According to the initial Google paper (“Detecting influenza epidemics using search engine query data“) the Google Flu Trends predictions were 97% accurate comparing with CDC (Centers for Disease Control and Prevention) data. The aggregated query data can be found on Google public data.
5.3. Obama campaign
The 2012 reelection campaign of Barack Obama was formally announced on April 4, 2011. During the last 18 months of the campaign there was an entire team of data scientists helping Obama improving the target fund-raising appeals and ads.
The team, led by Chief Scientist Rayid Ghani, collected and combined information on each registered U.S. voter to discover patterns.
Moneyball is a film based on Michael Lewis’s 2003 nonfiction book, an account of the Oakland Athletics baseball team’s 2002 season and their general manager Billy Beane’s attempts to assemble a competitive team using computer analysis to identify undervalued players.
The film reflected on the role of performance analysis in baseball and lessons that can be applied to data-driven organizations (The Lessons of Moneyball for Big Data Analysis).
5.5. Minard map
Charles Joseph Minard was a French civil engineer recognized for his significant contribution in the field of information graphics in civil engineering and statistics, specially noted for his representation of numerical data on geographic maps.
The map showed above is a graphic created in 1869 that shows the losses of Napoleon’s army during his Russian campaign (1812-1813). According to Edward Tufte, is one of the “best statistical drawings ever created“. You can find a detailed analysis of the visualization on the article Analyzing Minard’s Visualization Of Napoleon’s 1812 March.
5.6. Generic examples
We can find more practical data science examples in the following list created by Data Science Series (Ten Practical Big Data Benefits):
- Dialogue with consumers: Big Data allows you to profile consumers so that you can engage in an almost one-on-one, real-time conversation with them.
- Re-develop your products: Big Data can help you understand how others perceive your products so that you can adapt them, or your marketing.
- Perform risk analysis: Big Data allows you to scan and analyze newspaper reports or social media feeds so that you permanently keep up to speed on the latest developments in your industry and its environment.
- Keeping your data safe: you can map the entire data landscape across your company with Big Data tools, thus allowing you to analyze the threats that you face internally.
- Create new revenue streams: Big Data insights that you gain from analyzing your market and its consumers could be sold as non-personalized trend data to large industry players operating in the same segment as you.
- Customize your website in real time: Big Data analytics allows you to personalize the content or look and feel of your website in real time to suit each consumer entering your website
- Reducing maintenance costs: Big Data can spot failing grid devices and predict when they will give out
- Offering tailored healthcare: human genome mapping and Big Data tools brings medicine closer than ever to finding the genetic determinants that cause a disease and developing drugs expressly tailored to treat those causes.
- Offering enterprise-wide insights: with Big Data tools, the technical teams can develop systems and install interactive and dynamic visualization tools that allow business users to analyze, view and benefit from the data.
- Making our cities smarter: an increasing number of smart cities are indeed leveraging Big Data tools for the benefit of their citizens and the environment
This is the first of a series of articles that will introduce you to the data science field. In following articles we will discuss the difficulties you’ll find in your way to becoming a data scientist and we will go into detail on the tools and technologies needed for it.
If you like what you just read, please share it and make sure you are subscribed to our email newsletter.