Using vaex to do Python calcs with 30x speed up

Within finance, time series data is usually the bedrock of most analysis. If we’re using Python, what are the best ways to analyse this data? In this article, I discuss the various time series library available in Python. In particular, I focus on the vaex library in Python for dealing with large time series datasets, and compare its speed with Dask.

 

Python libraries for working with time series

Below, I’ve listed a few libraries we can try in Python if we’re dealing with time series. Note it isn’t an exhaustive list, and there are lots of other time series style libraries which I haven’t included modin etc.

 

  • Pandas – This is the most popular time series library and I use it a lot! However, when your datasets are very large you need to batch your calculations
  • Dask – This is a library for parallel computing with task scheduling. It has Dask DataFrames which look like Pandas DataFrames to the user, but they can be much bigger than memory, and underneath Dask handles all the batching and construction of a graph for computation for us
  • NumPy – It’s main library for working with arrays in Python. Whilst, it isn’t designed purely for time series, we can use NumPy arrays to represent time series, and computations can be quickly using pure NumPy than Pandas.
  • TensorFlow – Whilst TensorFlow is primarily a library for machine learning, the newest version has an NumPy like interface, to make it easy to use it instead of NumPy. It can also target the GPU.
  • Vaex – we’ll talk about that shortly..!

 

There are all sorts of tips and tricks we can use to speed up Python and the tools above, without having to resort to rewriting all our Python in another faster language like C. We can for example use Numba to speed up lots of numerical calculations (to target CPU or GPU) or Cython. I tend to like using Numba and it’s a great tool, although it does often require some rewriting of your code and here’s a recent article I wrote about Numba. The trick is to identify the particular bottlenecks in your code, and spend time there too.

 

Run across more cores or on the cloud

We can also try to run our code in more cores. I’ve used Celery, in tcapy, my open source transaction cost analysis library, to distribute computation with tick data to more cores (or it can be setup to compute across multiple machines). Dask also allows you to set up clusters for computation. However, ultimately if we are doing this locally, we are likely to face limits in the amount of CPU and memory we have.

 

With the cloud, this type of scaling becomes easier. You can use tools serverless compute, such as with AWS Lambda, which abstract away some of the complexity of managing multiple machines. Coiled is a service which allows you to easily run Dask clusters in the cloud.

 

Alternatively, rather than relying on Python to do a lot of our numerical computations, we can use time series databases. They can do the computations inside them, doing the heavy lifting for us, and then return results to Python. These databases could for example include kdb+/q which is specifically designed for tick data, we’ll hopefully talk about them at another time!

 

So what’s vaex?

Like Dask, vaex is a Python based library that allows us to do computations on datasets that are too big to fit in memory. You can use vaex to query data in a Pythonic way, similar to how you use Pandas or Dask. Hence, we don’t need to learn any different database query languages (e.g. like q or k). It should be noted that the vaex API is slightly different to pandas API calls.

 

Vaex can load up data in Arrow format. Arrow is an open source format for storing columnar data, in an efficient way. Just like with kdb+/q, Vaex won’t try to load up your dataset from disk. Instead, it is smart about which sections to read/write. Like kdb+/q, vaex memory maps the dataset from disk. However, vaex is all open source unlike kdb+/q and it’s also easier to query vaex, if you are used to Python. If we want to query kdb+/q we’ll have to learn q.

 

An article from AquaQ compares kdb+/q with vaex in more detail and also has some benchmarks. This article also introduces vaex with a focus on how it can speed up string operations.

 

How does vaex compare with Dask in a real life tick data example?

To try to test out vaex, I decided to test it out with a simple example, which I describe below:

 

  • download top of book tick data (bid/ask) for EURUSD between 2005-2021 from the retail broker, Dukascopy into Parquet files split by month
  • convert the Parquet files into Arrow format
  • calculate the bid/ask spread from the tick data and then took the average spread for each day in the history to plot

 

The downloading of the data took a long time, at least a few hours, but luckily it only needed to be done once. I converted the Parquet files to Arrow files, because vaex tends to work better with Arrow format files. Dask works fine with Parquet file.

 

The last calculation step was implemented both in vaex and Dask. The difference in computation time was huge. Vaex was over 30 times quicker!

 

  • vaex, 10.9 seconds
  • Dask, 5 minutes 14 seconds

 

Of course, there are ways to speed up Dask. We could have used a cluster for one etc. But why spin up lots of cores, when vaex seems to do a much quicker job on one machine? It’s much cheaper in terms of computation cost for one!

 

Note, that all the Python code is in this Jupyter notebook here including how do download the tick data and then two versions of the numerical computation using vaex and Dask.

 

Conclusion

We discussed the various libraries which tend to be useful for time series computation. Whilst Pandas tends to be the first library you think of for time series in Python, libraries like Dask and Vaex allow you to do computations in a transparent manner on datasets which don’t fully load into memory. An alternative would to be to do computation in a time series database, like kdb+/q and then output a summary to Python, and that’s something we’ve discussed a lot in the past in this blog.

 

Our main focus was comparing the performance of vaex with Dask, using a simple example of calculating the bid/ask spread in basis points for EURUSD tick data between 2005-2021. The differential in speed was huge, with vaex only taking just over 10 seconds, whilst the Dask example took over 5 minutes. Whilst this is just one example, it does suggest that vaex has a lot of promise if you want to do computations on large time series datasets fast! Dask probably has more out-of-the-box functionality compared with vaex (at this time), so it might take more time to write the code