Speeding up tick data calculations in Python

Time, time, time. At the current moment, we likely have a bit more time than usual. Despite this, it’s unlikely any of us actively want to wait longer for code to execute! Last week, I wrote about libraries for working with large datasets, like Dask and Vaex, or using databases like kdb+/q. This week, I’ll continue the theme, given all the feedback and suggestions I’ve got about the article. This time, the focus is on Python tools which can be useful for speeding up calculations, plus other tips and tricks you can use with tick data, which aren’t necessarily Python specific (thanks to @ewankirk for a few of these in reply to my original tweet, in particular with respect to tricks with tick data).

 

Cython https://cython.org/
Python is an interpreted language, hence it doesn’t need compiling to run it. The flip side is that it tends to be slower than compiled languages. Cython allows you to “compile” some of your code. Essentially, it is basically special Python-like code that can be converted into C and statically compiled down into machine code. You can also “release the GIL” with Cython allowing true parallelization of your code. Many Python libraries use Cython to speed up computation such as Pandas.

 

If you take time to annotate your code with type declarations for Cython and rewrite it, it can help to speed it up more. How fast it will be depends upon how much of the Python code you’ve written can be converted by Cython. If you want to play around with Cython, it’s pretty easy to do so in Jupyter notebooks. All you need to do is add %%cython to your code cell and your code in that cell with be compiled.

 

Numba http://numba.pydata.org/
Numba is similar to Cython, in that it can convert Python code to machine code. However, it does runtime compilation using LLVM. If you want Numba to compile your code in this way, you simply add a decorator to the function in question. As with Cython, however, it isn’t “magic” and it may require some code rewriting to get the maximum benefit. You can also target your GPU for execution, and write CUDA like-code with it provided you’ve got an understanding of how to write code for GPUs, which can be somewhat confusing (GPU coding is still on my list of things to learn!)

 

General tips and tricks with tick data
Datasets consisting of tick data tend to be big! Storing them takes up a lot of space, loading them takes up a lot of time etc. In my tcapy library, an open source library for transaction cost analysis, the main bottleneck which I spent the most time on was speeding up loading of market data, before doing much in the way of calculations. One tip is to simply run it through a compression algorithm when storing it. That’s basically what I do when I cache market data in memory with tcapy. There is an overhead from compressing and decompressing the data, but often this will be less than the additional IO from reading or writing large datasets to disk or memory.

 

We can also think about specific tricks for tick data. If we think about tick data on an exchange or similar, it is usually to a specific number of decimal places, so it can be recorded exactly. However, we might be storing this as a 32 bit or 64 bit float. In practice, most of the time, ticks will be repeated or will go up and down by 1 tick. Yes, there are times when it could have jump, but these are comparatively rare. We could therefore store the differences (ie. the deltas) between the prices using a byte for the vast majority of cases. For jumps or specific times (eg. first tick of a day), we can have a field of larger types which will only be populated on a sparse basis as reference prices. When reading the data, we can reassemble from the deltas and reference prices. We can also use the “delta” approach can also be used for timestamps, which are likely to be very closer together with tick data. I’m hoping to add some of these tips and tricks to my own tcapy open source library over time.

 

Note that, we’d need to be careful about such a “delta” approach when storing calculated data, which is often not exact.

 

Conclusions
We discussed some of the ways we can speed up code in Python like Numba or Cython. However, it is important to note that it often requires time to rewrite the code to maximise the speed up. How much time it’s worth spending depends upon the problem at hand. We also talked about some tips and tricks we can use on tick data more broadly to reduce their storage size.