Making Python parallel: a few points

20180701 Barbican

If you want code to run super quick, you could write it in C++. If you want to code something super quick, you could write it in Python. Is there a middle ground where you can get Python code to execute quickly, without having to go through the process of writing everything in C++? One way to speed up code is to make it run in parallel. With Python, the way to make code truly parallel is to literally kick off new processes, to get around GIL (global interpreter lock) limitations. You can also write specific code bottlenecks in Cython, which is a subset of Python, which is then converted into C code and hence it can be compiled to run quickly. There’s also Numba, a LLVM, which can be used to speed up certain Python code. My colleague,¬†Shih-Hau Tan has done work to benchmark various computations using Numba and Dask (which can be seen as a distributed version of pandas). Below I’ve written a few points to consider when trying to make Python code parallel.


Which libraries can I use to make Python truly parallel?

The simplest library is Python’s threading library. Whilst this can speed up tasks which involve a lot of IO blocking, however, for those where we want to do truly parallel processing, as we mentioned, we need to use libraries which work across different processes. In Python these libraries include multiprocessing and several variants (which differ in the way they “pickle” objects inside each process – I have found multiprocess a bit more forgiving about the types of objects it will allow to be pickled). We can also use specific distributed task libraries, like celery or for pandas-like operations something like dask.


Sharing memory between Python processes

The difficulty with kicking off new processes is that you need to manage the transfer of information between different processes. The standard way to manage this, is through pickling. This involves serialisation of objects in the initial process, and this is then deserialised in the destination process. There is a computational cost in this serialisation/deserialistion process. Furthermore, you have to be careful what objects you try to pickle (particularly complicated objects can could real problems). If you are trying to do this with very large objects, you are likely to run into various limits and the computation overhead could cause problems. Alternatively, you can use other ways of sharing objects between process such as Redis, which is a very fast key/value store. Redis also has a lot of cool features! In short though, the best way to reduce the burden is to simply to reduce the amount of interprocess communication.


Reducing overhead: Apache Arrow and Plasma

We have noted that serialisation and deserialisation can take time. Wouldn’t it be easier if we could share information without this costly process. The Apache Arrow allows the sharing of certain objects without costly serialisation/deserialisation process done repeatedly (it specifically support zero copy reads) across different process. It also comes with Plasma, which is an in-memory store which can be shared between different processes. I haven’t personally used Apache Arrow or Plasma yet, but it is something that I am looking to investigate.


Before going parallel

Before making code parallel, it’s also worth asking can I use some tips and tricks to speed up my code. My main focus tends to be on time series calculations, and indeed most analysis of market data involves time series at some level, so I shall focus on that. For example, for loops tend to be very slow, when dealing with large time series. If we are dealing with time series, maybe we can get pandas to do an operation in a vectorised manner, which tends to be much quicker. We can also gain a further speed up if we use NumPy directly to manipulate matrices and vectors directly. Ok, the code will look a bit more messy, but in some instances, it can be worthwhile to do stuff at a NumPy level.