How to speed up Python and TCA

Somewhat abusing a quotation by Dickens, coding is the best of times and also the worst of times. The worst of times are those hours spent debugging, what appears to be some innocuous code, that throws an exception for some totally inexplicable reason. When you find the problem it is usually something which was incredibly trivial. Of course, something is always obvious after you’ve discovered it, but never beforehand. 


In between the bugs, the frustration and the tears (admittedly, the last part of that tricolon has been used purely for an exaggerated literary effect), there is the question of optimization. As Donald Knuth has noted, “premature optimization is the root of all evil”. The priority is making your code work. However, once it works, if it is very slow and is likely to be executed repeatedly, it might be opportune to ask how you can speed it up. At the same time, we also want our code to be readable, to make maintenance easier. Open sourcing the project forces you to think about making your code to be as elegant as possible, given you know lots of folks will be looking at!


We recently open sourced tcapy, Cuemacro’s transaction cost analysis library (download it from GitHub). It is one of the first open source libraries for TCA. Most solutions tend to be closed source, and if you want to build your own internal TCA library it is likely to cost many hundreds of thousands of dollars, if we also count the maintenance costs. Use tcapy and save hundreds of thousands of dollars! Essentially, tcapy takes in large amounts of market tick data and your own trade/order data. It then calculates various statistics using a combination of this data, to tell you how much you are paying for your trading activity. It allows you to compare between different liquidity providers, trading styles, algos etc. We are faced with several time consuming steps:


  • IO intensive: Loading large amounts of data from disk is slow
  • Compute intensive: Making calculations on large amounts of data is slow
  • Compute intensive: Generating graphical output on large amounts of data is slow


We also face constraints, that we don’t want to rely exclusively on in-database computation, as we want our solution to be flexible and allow the use of open source tools. We do not want tcapy to be tied to one specific database. We can find databases which might be very fast, but they might also be expensive and closed source. It defeats the point of making tcapy open source and flexible without specific vendor lock in, if some of the dependencies are very expensive. Plus, we want to make sure that tcapy can fit in with whatever database infrastructure users already have. We also want to try to stick to Python, rather than having to rewrite it in a different language (yes, I know C would be quicker!)


We can run a code profiling tool, such as that in PyCharm, to identify which parts of the code are slowest and fastest. This is important, because we do not want to waste time to optimise code which makes us feel smart, but then it doesn’t impact the execution time.


Loading large amounts of data

There are many reasons why loading a large amount of data from a database can be slow. Disk speed tends to be slow compared to memory, so we get an IO bottleneck. There might also be latency issues with a database. Fetching a very a massive time series from a database can lock it up for a long time. We can speed it up by cutting it up into multiple calls done in a parallel and then stitching back the results together. We can also try to avoid hitting our database repeatedly, by using in-memory caching. For that we have used Redis extensively.


We used the Celery library to distribute our computations, which can run multiple workers on a single machine or a cluster of machines. However, this results in another problem. We need to minimise the amount of communication between the various workers and the main process. There’s a lot of overhead, because we need to serialise and deserialise objects between processes.


For market tick data, this overhead is especially problematic given the size of the datasets. We want to be able to have a lot of control over this transfer of large datasets back and forth (and minimise). If we just rely on Celery, we may (actually, we will!) have problems because of problems using pickle with large datasets. We can solve the problem, by passing around “handles” between the various processes. These “handles” are basically like pointers for large datasets. Each of these point to the Redis keys where the actual data is cached. Hence, these “handles” have tiny sizes in memory. We only extract the underlying the dataset from the “handle” when absolutely necessary, so we minimise the amount of serialisation/deserialisation we do. If one of our workers is tasked to fetch the dataset from the cache, for computation later, we can just pass around the “handle” till the computation requires it. This much better than repeatedly serialising/deserialising it.


We can also compress the market data significantly during the serialisation and decompression for deserialisation. This can reduce the amount of RAM we need for Redis and speeds up the serialisation/deserialisation process. This also cuts down on any network traffic when accessing Redis, although ideally. It also significantly speeds up the process of storing and retrieving the data in Redis, given the data sizes are vastly reduced. With a few extra tricks we use when caching of both market and trade data, we can also cut down the time it takes to fetch them given we are effectively bypassing a (usually) slower database.


We’ve also started to look at Apache Arrow and Plasma, which is supposed to help in the sharing of objects between multiple processes, reducing the overhead of serialisation and deserialisation of the objects.


Making calculations on large amount of data

Everyone who has coded in Python knows that using for loops is pretty slow for doing computation. You should vectorise your computation, using Pandas and NumPy. One simple trick to speed up Pandas based calculations is to rewrite the code at a lower level using NumPy. We ended up doing this quite a few times, for example when calculating benchmarks such as TWAP (note that whether it is quicker or not, depends on your specific use case, and the size of the dataset).


We also want to explore other techniques to speed up our computation code such as using Numba or Cython. In both cases, we’ll likely need to rewrite those very specific parts of the code, where the number crunching is done, to get the maximum speedup (can’t we get a free lunch at least once?). Both Numba and Cython, allows us to release the Python GIL (global interpreter lock) to allow us to more easily parallelise code within a process. In experimentation, we’ve managed to get some speed up with Numba, however, the main bottleneck appears to be once we interact with a Pandas DataFrame (eg. adding new columns). We’re also looking at NumExpr (or using pd.eval) to speed up arithmetic calculations, and have begun to use that. There’s also potential to use the Dask library, which is popular for working with large datasets.


Generating graphical output on large amounts of data

One of the main points of tcapy is that it can generate cool graphics to illustrate our trades alongside market data and all sorts of charts (eg. showing distribution of slippage etc). We use Plotly as our graphics engine, given it works very well with Dash, which is a great tool to create interactive web dashboards. The difficulty is that generating the JSON which Plotly uses to render charts, is slow when we have large datasets. We can reduce this speed by resampling the market data, so we don’t plot every single point. We also ended up rewriting parts of Plotly, so that it renders JSON much quicker, when creating the JSON necessary for creating candlestick charts. We also generated the JSON for this during the calculation in Celery, so it could be fetched later.



We discussed how we have optimised the tcapy library. In particular, we noted how we tried to do this with the constraint of avoiding in-database computation, so our solution was database independent. This constraint was important so it didn’t force users to use any closed source databases. We then went through ways how we managed to speed up various parts of the TCA process.


We are still working on optimisation of the code to try and figure out how to further alleviate the various bottlenecks, such as the serialisation/deserialisation process, and the heavy duty calculation steps involved. If you have any ideas on how to speed up these various parts of the tcapy, please do get in touch and help to contribute to our open source project!