Essential skills of a Data Scientist


R Problem Overview

What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?

A few ideas germane to this discussion:

  • Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
  • Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
  • Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
  • What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
  • OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company


R Solutions

Solution 1 - R

To quote from the intro to Hadley's phd thesis:

> First, you get the data in a form that > you can work with ... Second, you > plot the data to get a feel for what > is going on ... Third, you iterate > between graphics and models to build a > succinct quantitative summary of the > data ... Finally, you look back at > what you have done, and contemplate > what tools you need to do better in > the future

Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)

Step 2 means visualisation/ plotting skills.

Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.

The final step is mostly about soft skills like introspection and management-type skills.

Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.

Solution 2 - R

Just to throw in some ideas for others to expound upon:

At some ridiculously high level of abstraction all data work involves the following steps:

  • Data Collection
  • Data Storage/Retrieval
  • Data Manipulation/Synthesis/Modeling
  • Result Reporting
  • Story Telling

At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.

Solution 3 - R

JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:

  1. Skill #1: Statistics (Studying)
  2. Skill #2: Data Munging (Suffering)
  3. Skill #3: Visualization (Story telling)

Solution 4 - R

At dataist the question is addressed in a general way with a nice Venn diagram:

venn diagram

Solution 5 - R

JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.

The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.

Solution 6 - R

I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.

In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.

Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.

Solution 7 - R

Matrix algebra is my top pick

Solution 8 - R

  • The ability to collaborate.

Great science, in almost any discipline, is rarely done by individuals these days.

Solution 9 - R

There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.

Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.

Solution 10 - R

Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.

Solution 11 - R

Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.


All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionharshsinghalView Question on Stackoverflow
Solution 1 - RRichie CottonView Answer on Stackoverflow
Solution 2 - RJD LongView Answer on Stackoverflow
Solution 3 - RDrewConwayView Answer on Stackoverflow
Solution 4 - RmropaView Answer on Stackoverflow
Solution 5 - RByron EllisView Answer on Stackoverflow
Solution 6 - RPhil RackView Answer on Stackoverflow
Solution 7 - RNeil McGuiganView Answer on Stackoverflow
Solution 8 - Rwkmor1View Answer on Stackoverflow
Solution 9 - RmattreplView Answer on Stackoverflow
Solution 10 - RPaddyView Answer on Stackoverflow
Solution 11 - RrohitView Answer on Stackoverflow