Making Python work with large datasets

There are only so many things you can keep up in your head at any time. As humans our bandwidth is fairly limited. We simply can’t get everything done. The only way to get large scale stuff done is to work with others. Management is a special skill in itself. There’s a comedy sketch with Morecombe & Wise, where Eric Morecombe plays the piano (not very well). When Andre Previn complains, Eric replies “I’m playing all the right notes. But not necessarily in the right order”. Just because you have the ability to play notes, doesn’t mean you’ll make good music. In the same way, having a highly skilled team will only produce something if management can coordinate it. The bigger the team, the more difficult it is. Indeed, it’s one of the advantages a startup has.

  

Computers have a lot more bandwidth when it comes to solving data problems. If you work in finance, you often deal with time series. In Python a solution like Pandas is great for this. However, once your datasets becomes sufficiently big, it also becomes a struggle and Pandas will run out of memory. You can attempt to split up your problem, and batch process it.

  

You can try to use parallelisation tools, like threading and multiprocessing to help speed up the processing. You can also use do your computations at a lower level, using NumPy. If you’re a bit more adventurous then maybe try Cython which let’s you write C code in a Python like way or Numba which is an LLVM.

  

If you want to do batch processing, but not worry about all this splitting, there are Python tools, which can do this automatically. Two such tools, worth a look at are Dask and Vaex. Dask has similar data structures to Pandas and NumPy (Dask DataFrames and Arrays). However, they internally split up the computation, so you can deal with large datasets, perhaps several hundred GB in size. Dask cleverly does all the batching for you. Vaex is a similar tool, and also has some fun visualisation tools built in.

  

If you’re interested in hearing more about this subject of making Python faster and parallel, come and here me speak at The Thalesians in London on 25 Sep. I’ll be giving some code demos of some of the Python libraries I’ve talked about above and much more. Tickets are available here.