Data science in a database?

20181125 Richmond

If you have a burger, a pretty important part is the patty. If the patty is awful, the rest of the burger is going to be pretty bad. However, just having good beef is not sufficient. You can overcook it, put too little fat in the patty, and ruin the beef. In a sense, data science is like making a burger. The data is the beef and how you cook it, is basically all the analytic work you do on top of it. Ok, maybe the analogy isn’t perfect, but I was mainly looking for an excuse to put burgers into the article.

 

Within finance, you are likely to be dealing with time series. Typically this can involve prices, but often it can be derived from many other datasets, which have been structured into this form. Often data is stored in database (or flat files), these are loaded up and then you can start analysing them (say using Python and libraries like pandas). However, we can often face the choice of doing some analytical work within the database first, before outputting it to our main analytics platform, such as Python.

 

I haven’t got a right or a wrong answer for how much analytics you should do inside your database first. However, it is worth thinking about. Some databases like KDB are heavily optimised for dealing with time series data (of a very high frequency). I’ve only recently started looking at KDB and whilst its query language isn’t always that easy (and I certainly have a long way to go to mastering it), it does seem to do a lot of slicing and dicing of the data very quickly. If you’re interested in KDB, I’d recommend the new book by Jan Novotny, Paul Bilokon et al, which is coming out next year (there’s also a KDB course happening this week based on the book as well). Of course KDB isn’t the only database which specialises in dealing with time series, and there are many open source alternatives, such as Arctic, which is a high performance time series wrapper for MongoDB, which I also use (although typically this is more for fetching data and doing analytics elsewhere). There are also some interesting GPU based databases such as Kinetica. Kinetica can handle very large datasets and do realtime analytics on them with relatively user friendly database queries, and I’m hoping to investigate this in the future. We can also use tools such as Apache Spark if we have massive datasets to do computations on clusters.

 

You could opt to do absolutely all your data analysis inside a database. How much you decide to do, will be dependent on whether it will offer all the features you want. Also there’s a question of how easy it is to code it up inside a database query language versus using something like Pandas in Python. Also we want to make sure that if we do a lot within the database, the code should be reusable and maintainable. This might involve writing higher level code to wrap around the database queries you write, to ensure better code reuse, kind of similar to what SQLAlchemy does, wrapping SQL in Python objects (ok, I admit, some people will prefer just to write queries directly, but I generally prefer wrappers for this type of thing!)

 

In practice, there’s probably a boundary, whereby some element of pre-processing is worth doing inside the database (if it’s much faster), but then doing more outside, particularly if you are reliant on a lot of Python specific procedures. I haven’t go the right answer for the exact boundary and depends on your specific case. Also if your dataset is relatively small, then maybe just doing everything in Pandas is quick anyway, and you don’t need to bother thinking about any other solutions. However, with a massive dataset, we might have to utilise other options.