As regular readers of my posts may have realised, I kind of like burgers. It’s a simple meal, but somehow very satisfying. There’s obviously vast differences between the burgers on sale, in terms of quality. It isn’t necessarily the case that the most expensive burger will be best. In practice, excessively expensive burgers, end up having too many ingredients, which result in a burger which is too rich to eat. In practice, what’s most important is to get the basics right.
What’s the most important basic ingredients in a burger? Is it the type of cheese? Is it the beef that goes into the burger patty? Or perhaps it’s the sauce? It’s a tough one. However, just as important as the ingredients themselves is something else, which is somehow easy to overlook: a fridge. If you don’t have a fridge, it’s probably not going to be particularly easy to store cheese or meat safely. Basically no fridge, no burger!
When it comes to analysing financial markets, the ingredients are key, essentially the data. However, with data, comes the need to storage. We need to have a good “fridge” to use our burger metaphor. Typically most financial data is in some sort of time series type format. You have some timestamp, and then some observation, like a price, economic data, news article etc. So the question basically amounts to what is the best way to store time series? In this article I’ll go through some of the solutions for storing time series data. It is of course not an exhaustive list but hopefully it’s still a start.
The simplest way to store a time series is in CSV format. It’s a very simple format and is pretty much universal, so is often used by data vendors for distributing data. However, it can be slow to parse, so it isn’t something you want to be continually reading and writing from. We could also use other flat file formats such as Parquet or HDF5, which can be quicker. If we using Python’s pandas module it’s very easy to read/write in any of these flat file formats. There’s an interesting library PyStore by Ran Aroussi, which basically uses Parquet (together with heavy compression) to store time series on disk, but partitions the files by date, to make it easy to access specific data ranges. However, using flat files might be tricky if we’re in a team and multiple users want to write to the files at the same time. A proper database would be better for that to handle this interaction. Otherwise, we’re basically just going to end up adding features, that basically result in us developing a full database.
When it comes to databases, SQL databases are probably the most well known, like SQL Server, PostgreSQL and MySQL (and there are many other similar SQL databases). They store data in tables, in a predefined format. For low frequency time series such as daily data, these are fine. If we want to store trade data for a trading desk doing a reasonable volume of trades, it should be fine. However, if you want to use very high frequency time series (eg. quote data, trade data), these solutions might be too slow. One interesting new time series database, which extends PostgreSQL is Timescale which creates the concept of a hypertable to store time series data. It basically ends up storing time series partitioned in various tables (which in practice is a technique used by many time series style databases). Whilst I haven’t used it, the benchmarks I’ve seen, seem impressive.
An alternative is NoSQL database, where data is stored in documents, sometimes in a key-value style store (or could be graph based etc.), which don’t have specific predefined format. MongoDB is probably one of the most well known of these NoSQL databases. Redis is also popular, and is typically used as temporary key-value store (although it is possible to persist data to disk). I’ve used Redis extensively to cache time series data temporarily.
Arctic, written by UK quant hedge fund Man-AHL is a library that I use extensively (PyStore mentioned above is based on Arctic, albeit with a different back end store). It takes pandas dataframes and then compresses them into smaller chunks and pushes them into MongoDB for storage. It’s very fast to query specific date ranges and also to write data. However, to do any sort of analysis, you need to take out the data from Arctic and use Python (rather than doing in-database computation). However, it is open source, which is definitely an advantage for me. Another quant fund who has open sourced a database storage solution is Two Sigma. They’ve released Flint which stores and analyses time series using Apache Spark.
By its nature time series data is columnar. kdb+ by kx systems specifically takes this structure into account to optimise storage (as do many time series database). It is a very fast database and enables you to work in memory or on disk. It uses q as it’s query language, and there’s a steep learning to curve to understand it. There is also a subset of it called Q-SQL, which looks similar to SQL, but isn’t exactly the same. I’ve used kdb+ a small amount (I’m certainly no q expert!), writing an adaptor to allow users of my FX TCA Python library to query market data stored on kdb+. I have to admit, I was very impressed by how fast it was, but did also find the language quite tricky. The language is very powerful and you can do a lot in-database analysis. My goal for 2019, is definitely to learn q properly! If you’re interested in learning q, I strongly recommend getting the new book on q by Jan Novotny, Paul Bilokon, Aris Galiotos and Frederic Deleze. kdb+ however is not open source, so might be as accessible as something like Arctic/MongoDB, but there is a free version available for non-commercial use. There are also other dedicated time series databases like InfluxDB. In my experience, kdb+ was quite a bit faster however. InfluxDB is however, open source and I found it’s query language a lot easier to understand when I’ve used it briefly. Again, I’ve added a wrapper in my FX TCA Python library to query market data stored in InfluxDB.
GPUs have become very popular for speeding up computation especially for machine learning and are supported by libraries like TensorFlow. However, there are also several GPU based databases which could be used for storing time series. These include Kinetica, OmniSci (formally MapD) and a few more academically focused projects. Also very recently Uber open sourced their own GPU database AresDB, which I’m very keen to investigate. If you’ve used GPU databases to store time series I’d love to hear about your experiences.
Many of the above databases are available on the cloud in fully managed instances (like MongoDB Atlas). In a fully managed solution, the cloud provider takes care of backing up your database, allocating resources etc. You can of course, create your own database instances locally or on a cloud based Linux box (eg. Amazon EC2), but then you need to manage that too. There are also some databases like Amazon’s DocumentDB (which is compatible with MongoDB), Google’s BigQuery and Amazon Timestream (available soon) which are only available on the cloud. I’ve used BigQuery to store FX tick data and also do some basic computations and it seems fairly easy to use and it scales to massive datasets, but I’m hoping to do more analysis in it in the future. In general, cloud based databases make sense if you are doing all your computation in the cloud (eg. using Amazon EC2 or Google Cloud Compute Engine) and want the data nearby (and don’t want the hassle of managing local servers).
Which database is best for time series is best? I guess it depends on your specific needs. However, hopefully, this article will have given you some ideas about this topic. I’d also love to hear your thoughts on solutions for storing time series. Just make sure you invest in a good fridge!