In the dim and distant past (let’s call this 2007!) if you wanted to work with time series, you might have to resort to writing your own time series library. When I say a “time series library”, I’m not really talking about any sophisticated techniques for manipulating time series. Instead, I’m talking about ways of combining time series, joining them, aligning them, resampling them etc. Many of these tasks are not “complicated” but often a necessary part of working with time series.
I remember in those dim and distant days, spending many hours writing a Java based time series, when I was working at Lehman Brothers to do the aforementioned tasks. I’m not sure whether it was the most exciting coding task I’ve ever done, but it was a necessary prerequisite for doing the fun stuff, such as creating a library for running an intraday trading strategy on G10 FX live and also for backtesting it.
These days, if you’re working with time series, you have a lot more help. If we focus on Python there are a plethora of different libraries for dealing with time series (outside of Python you’ll also find many well featured time series libraries such as for R and Julia). So what are the go to libraries for working with time series in Python? I’ve listed a few, and during my Python workshops, I teach folks how to use some of these.
- Pandas – This is by far the most well known library for manipulating and storing time series, and if your dataset is small (ie. sits in memory), it’s great, otherwise you need to batch your computation
- NumPy – NumPy isn’t specifically designed for time series, but in some cases, you might wish to do your computation on NumPy before wrapping it in a pandas DataFrame later, mainly for reasons of performance. Numba is also compatible with NumPy arrays (but it doesn’t work with Pandas), and you can use Numba to speed performance of algorithms.
- Dask – Dask is a flexible library for parallel computing in Python, and it has within it the Dask.DataFrame object, which let’s you do Pandas like time series operations on massive datasets, even if they don’t fit in memory. Dask does all the batch computation for you. There’s also a new service Coiled, which let’s you spin up Dask clusters easily on the cloud.
- Vaex – Like Dask, Vaex allows you to do out-of-core computations on large datasets that don’t fit in memory. It does a lot of cool stuff like memory mapping the data (like kdb+/q) and utilises your memory very efficiently. In my experience it has been super fast, and quite a bit quicker than Dask. However, the API is a bit different to both Dask and Pandas, so it can take a bit more time to get used to (see this Jupyter notebook I wrote on Vaex).
- Datatable – A high performance Python time series database from h2o
- cuDF – This is a GPU accelerated DataFrame
Alternatively, we can also seek to do number crunching for time series within a database, before exporting our aggregated results to Python (often in the form of a Pandas DataFrame). The advantage of this approach is we don’t have the added step of pulling data from the database before processing it. Instead, we process the data close to where it is stored in the database itself. This might be particularly useful for tasks such as cleaning the data and resampling it, and getting it into a more manageable size for Python. However, it is often possible to do a lot more within a database, even running machine learning models, if you’d like (and I’d strongly recommend reading Machine Learning and Big Data with kdb+/q for more information). Here are a few databases which are particularly well suited to time series data (note this isn’t an exhaustive list) below. I’d also recommend looking at this webpage from Mark Litwintschik, where he benchmarks a bunch of these databases (and more) doing queries on the well known NYC taxi rides open dataset.
- kdb+/q – Kx’s database that is used in many sell side eTrading desks that uses the q language
- Shakti – Arthur Whitney, who founded Kx, has recently started another database, which used k
- InfluxDB – an easy to use time series database
- ClickHouse – open source time series database
A lot of the above can also be run on the cloud too. However, there are also a number of cloud specific services that can be used for processing time series.
- Google BigQuery – Google’s cloud based database which can be used with massive datasets
- AWS Athena – This services allows you to do database queries on data stored in S3 buckets (eg. in Parquet files)
- AWS Timestream – AWS new database designed specifically for time series (see this Jupyter notebook I wrote on Timestream).
There are lots of solutions for dealing with time series data if you’re using Python, and the above isn’t even an exhaustive list! We can also use various databases or cloud based solutions to do heavy duty number crunching for us, and then export smaller datasets into Python to do further processing. Precisely which library or technology you use will depend on your dataset and how large it is.