Libraries for large datasets in Python

People say the best way to learn something is to teach it. It’s obviously a bit of cliche, but I have to admit it’s one of things which really does ring true. When I’m researching the markets or coding something, I’m generally pretty focused at the task at hand. I’ll learn what I need to do to get the task done. However, when I’m teaching (or when I’ve be coauthoring The Book of Alternative Data), the objective is go over a subject in a wide ranging way, rather than snippets here and there. Inevitably the whole process of preparing a course or writing a book, means I learn lots of stuff in the process which I haven’t yet picked up on the job.

 

Over the past few weeks I’ve been preparing a course on Python, alt data/NLP and large datasets, which I’ll be teaching at Queen Mary University of London (and if you’d like me to teach it at your firm let me know!). I’m also going to be teaching a more general Python course, Python for finance for QDC (sign up here). I recently tweeted about some of the Python libraries I’ve found useful for working with large datasets (in particular time series), and the thread got a lot of interest and lots of great suggestions, so I’ve elaborated on it below. Next week, I’ll write another column, inspired by many of the replies on tips and tricks you can use to reduce the file size of large datasets and speed up their computation with tools like Numba or Cython, so stay tuned for that, including a few ideas which were tweeted by @ewankirk.

 

Pandas
For time series that fit in memory, Pandas is usually a fairly conventional choice. If your dataset doesn’t fit in memory, you can always try to rewrite your code to batch the computation, and often this is probably the first thing I’d try.

 

Dask
Dask has its own version of DataFrames, which look like Pandas DataFrames. Very often your code will require very little modification to use Dask vs. Pandas. It allows you to process massive datasets that don’t fit in memory, and it does all the batching for you, creating an optimized computation graph. For example, you can give it a directory full of Parquet files and it treats it like one file. You can also run it on a cluster too. Often I end up using Dask with Pandas.

 

Vaex 
This is relatively new library, which also tries to do out-of-core computation like Dask. It can be super quick to do visualizations of large datasets and is pretty smart about how it loads data from disk. It’s not as feature rich as Dask, and the syntax is slightly different to Pandas, but can be good for quickly exploring datasets before you get stuck in. I was impressed at how fast it could plot the coordinates of the popular NYC Taxi dataset on a map.

 

PySpark
For massive datasets, I often heard about folks using Spark, which is analytics engine that runs on the JVM. PySpark lets you access it via Python. You can do computation using SQL or also using Spark DataFrames. There’s also Koalas, which is has very Pandas-like DataFrames, and is easier to use (although it doesn’t have quite as many features as Pandas yet). Like with Dask you can also run it in a cluster. Later versions of PySpark, come with Spark, which makes it easier to setup.

 

Datatable
I haven’t used it, but it was suggested as a reply to my Twitter thread, and it does look promising. It’s basically based on R’s data.table library. It can also work on large datasets that don’t fit in memory. It also uses multithreading to speed up reads from disk. Underneath it has a native C implementation (including when dealing with strings) and takes advantage of LLVMs. Will work on Windows from 0.11 onwards.

 

Databases
OK, this isn’t strictly speaking a Python library! However, using a database can also be an option for working with large datasets that don’t fit in memory. We can do computations inside a database and output smaller processed datasets to use in Pandas. SQLite is probably the easiest place to start if you’re looking for a very simple SQL database. SQLAlchemy also gives you a more Pythonic way to do SQL queries. For tick data, we might consider kdb+/q, which is ideally suited to columnar data and there’s a wrapper qPython to access it from Python (as well as PyQ). If you just want to fetch tick data quickly (but not process inside a database), Arctic/MongoDB is something I use often and it’s open source too. PyStore looks like Arctic, but stores the data in Parquet files.

 

Conclusion
Dealing with large datasets is becoming increasingly commonplace, whether in financial markets or outside of them. Python has many tools for dealing with them. I’ve focused mostly on those libraries which have specific tools for dealing with DataFrames, but there are of course numerous other libraries you can use to distribute or parallelize your computation for more generalized calculations including things like Celery, which I’ve used extensively (this would probably need another blog write up). Which of the tools I’ve talked about is best for you, will largely depend on your use case. However, I do think of all the tools I’ve talks about using Dask in combination with Pandas is probably a good place to start.