Speed up Python data access by 30x & more

20170211 McLaren

Let’s say you send a letter from London to Tokyo. How long would it take to get a reply? At the bare minimum, it takes 12 hours for a letter to fly there, and then another 12 hours for a reply to fly back, so 1 day at least (and this ignoring the time it takes for your letter to be read, the time it takes to write a reply, the time it takes to post it etc.). We could of course use faster means of communication like the phone or an e-mail. Whilst the delay is going to be much lower, the delay will be at least a few hundred milliseconds.

 

Whenever you are analysing market data in Python or indeed any other language, a lot of time is spent loading data, even before you do computations and statistical analysis. Just as with our letter example, often the data you are trying to access might be across a network. Hence, it takes time to fetch this data before you can put it into your computer’s RAM. The difficulty is every time you make changes to your Python code to change your analysis, whatever you loaded up into memory is lost, once its finished running. So next time you run it, you have to go through the process of loading up the data, even though it’s precisely the same dataset. In my Python market data library findatapy, I’ve written a wrapper for arctic  (my code here), which has been open sourced by Man-AHL. It basically takes in pandas DataFrames, which can hold market data, compresses them heavily and sends them to MongoDB for storage. By compressing the data, it reduces the amount of storage on disk when it is stored by MongoDB. Also because the compression is done locally, it takes a load of the network when the data is before send to your computer.

 

As a bit of an experiment I used my library findatapy (via arctic) to access 1 minute data from 2007 to the present day for 12 G10 FX crosses, which is stored on my MongoDB server. The output of this query amounts to around 40 million observations. The Python code also does joins together all the time series and aligns them, which takes a bit of time. In total it took around 58 seconds to load all this FX data across my network and align it into a single dataset, which will be number crunched. My MongoDB setup is far from optimal, and the database I was accessing was across a wifi network, rather than a wired gigabit network etc. If every time I rerun my Python script, I have to go this 58 seconds process to get a dataset, it’s going to seriously slow down the process of market analysis, which is often an iterative process. Luckily, there are lots of tricks you can do to make this process faster. One solution is to cache the data in our local RAM in such a way that it will still be available even if we have to restart the process. We can use Redis to do this, which is a simple in memory database (basically a key/value store). When we’ve loaded up the data simply push it Redis to store temporarily. Whenever we need it, just pull it from Redis! When we fetch this large dataset via Redis, it takes under 2 seconds, nearly 30 times quicker! Why is it so much quicker? We list some reasons below…!

 

  • We aren’t going across the network, reducing latency
  • Given we are using our local computer’s memory RAM, it is going to be a lot quicker than loading via a hard disk (even if it’s an incredibly fast SSD drive, RAM is still much quicker!)

 

We need to employ a number of tricks to get this work in a practical way. RAM is a scare resource, so we need to employ heavy compression on any DataFrame we send to Redis, just as arctic when sending/receiving data from MongoDB. Compression does have some CPU overhead, but hopefully, we can use a compression algorithm that runs on multiple cores. I’ve opted to use blosc for this compression, which is very fast. Redis can also be configured to have a maximum memory size and you can create rules on how to flush the cache over time. You can also opt to run Redis on a computer. Yes, this will add some network latency, but if you have multiple users, who make similar data calls, it could make sense to share the cache. Also a big server, is likely to hold more RAM in any case, than an individual user’s machine.

 

I’ve included a transparent way to use this Redis cache in findatapy, with minimal effort from the programmer, you just make the same call twice, and magically the second time it is called, it is a lot quicker! I’ve given a quick Python demo on how to use it in findatapy (see cache_example.py), when downloading financial data from Yahoo. In that example the speed up is over 1000x, given the time it takes to call to Yahoo over the internet, which is over a second for downloading the data from Yahoo, versus 1 millisecond for fetching from Redis. It’s also worth noting that in findatapy, I’ve added a large amount of parallelisation when making calls to external data sources like Yahoo or Bloomberg (you can tweak the number of threads used and also which Python library it uses for threading), so it’s already much quicker, than a simple call you’d usually make.

 

Of course, this is not the only thing you can do to speed up fetching data. Another approach is to use a database like kdb, which is very fast and in memory designed for time series data, and do as much computation as possible in kdb (such as joining), before spitting out to your application to process, which in addition reduces the amount of data transferred over the network. I very much recommend the forthcoming Wiley book on kdb by the kdb gurus Paul Bilokon and Jan Novotny, if you’re interested in finding out more about this and I’ll definitely be reading it! There are also lots of tricks, which you can use to increase the speed of MongoDB, such as creating replica in memory versions of MongoDB (I’m far from an expert though on this though!)

 

It should be noted that the approach I’ve discussed also has limits, in that you can’t use it for absolutely massive datasets, given these would overwhelm any reasonable amount of RAM, but for just about everything else, it’s worth a look. I’ve found adopting this caching approach has improved my workflow considerably when developing trading strategies. Let me know if you have any questions about backtesting trading strategies or indeed about Python data analysis in general!