What’s a data scientist?


I recently attended the Open Data Science Conference (ODSC) in London. The very first presentation was given by Gael Varoquaux. One of the first questions he asked was an obvious one, at such an event: what’s data science? In a nutshell, Gael, described data science as a combination of statistics plus coding. Whilst the term data science has become more prominent in recent years, the concept of using coding to do statistics isn’t of course new. I’ve been doing for many years in financial markets (and other folks, have been doing this a lot longer than me!), although, admittedly this week was the first time, when someone asked me “are you a data scientist?” Whilst it took me a few moments to think answer, in the end, “yes” seemed like my best answer.


What is different, is that in recent years, we have seen an explosion in data being generated by individuals and businesses. If we think of many businesses, they collect large amounts of data. Take for example a supermarket, whilst it might seem obvious they collect data on what they sell both instore and online, they also collect data from all manner of other areas and sensors. A lot of this can be so called exhaust data that is collected as part of their usual business. At the same time as this increasing amount of data being collected, computing power has multiplied, and is more accessible, notably through the cloud. Open source tools like Python and R, and the various libraries built on top of them like pandas, have also made it easier for this data to be analysed. The combination of all of these things, has given rise to data science and data scientists who work in a multitude of different areas, who can assess the value of this data. Indeed, at ODSC, whilst, everyone attending was a data science, and there were commonalities in what we did, what tools we used, what techniques we might apply, the number of different industries was incredibly diverse, ranging from trading, to healthcare, to travel, naming just a few examples.


It is only in recent years that the idea of “data science” has become something taught on university curriculums, with that specific name. As a result, most data scientists today, have come to data science from other fields, whether it is more from the statistics, computing or engineering. In some cases, they might have come from data science from multiple areas and I include myself in this bucket. When I graduated from university (let’s just say a few years ago…!), studying a degree course maths and computer science degree, I covered many areas associated with data science, in both statistics and programming. (I would love to pretend I had some sort of foresight about the growth of data science, but I ended up choosing the course, mainly because I love maths and enjoy coding!)


So far we have stressed the commonalities between what data scientists do and the types of tools they might use. However, at the same time, it is key to have some element of domain specific knowledge, which is sometimes lost in all the buzz about data science. One talk I particularly enjoyed was given by Norbert Kraft from Nokia Bell Labs, the legendary research institution which saw the invention of UNIX and C. His talk was on detecting anomalies in multivariate time series, in the particular example of telecoms, trying to detect degradation in mobile phone signals (eg. too many dropped calls). More broadly, one important point he made is that the area of finance is different: it doesn’t follow physical laws in the same way as other data science problems. The trading environment keeps on changing and you are also in a race with other participants. By contrast, if you are doing something like handwriting recognition, whilst the problem is not easy to solve, it doesn’t keep changing.


Data science is an exciting area, building upon many crucial fields including statistics and computing. Furthermore, it is going to become even more important in the years ahead! If you have the data, data science can help us unlock the value of it, in particular when combined with domain knowledge.