Setting up a Python data science environment

It’s impossible to learn something like coding quickly from scratch. Even after a long time (30 years!) I keep on learning whenever I code. I can imagine in the decades ahead I’ll keep on learning about how to improve coding. Whilst technologies keep on evolving, and every so often a language comes into vogue, ultimately the principles of coding remain similar (eg. what’s a variable, how do you design a framework etc.?)

 

When I teach Python what’s the main thing people always ask about first? It’s nearly always how they can setup a Python environment. I have to admit setting up a Python environment can be fiddly at times. In past I have written in more general terms how to setup a Python environment. Here I go into more detail, about some tips I have got for setting up your own Python environment for working with financial data and for data science more broadly, which I’ve learnt whilst putting together my own Python environments. In particular, I focus on how to troubleshoot problems (which will happen!).

 

I’ve open sourced the Python conda environments that I use for teaching my various Python for finance/alt data courses and today I upgraded them to Python 3.8, which also includes various Cuemacro financial libraries, as well as popular data science libraries. You can download full instructions on how to install my data science Python environments for Windows/Linux and Mac OS X on my teaching GitHub site.

 

Try using Anaconda Python, conda (and mamba!)

The Anaconda distribution of Python comes with lots of packages standard which have been tested to work together. It also has the conda installation manager, which allows you to install Python packages (just like with the standard pip installer). However, some packages can include more than Python code, and with pip require some additional installation steps. One example is blpapi, which I use a lot, which is Bloomberg’s Python API. Underneath it also has a C++ library and requires changes to the Windows path. If we install the conda package for blpapi, all these extra steps are done automatically.

 

There is also a new project mamba which is a very fast drop in replacement for conda, which I tend to use. For Linux and Windows, I’ve tended to use version 0.7.3 of mamba and for Mac OS X version 0.4.2. I have tried other versions of mamba (and this might be something specific to my computer!), but conda tended to hang when trying to install them, so best to stick with ones which work!

 

Resolving Python library version conflicts

One of the great things about Python is that there are lots of third party packages to do all sorts of cool things for data. Equally, the bad thing is there are so many third party packages! You can often get version mismatching, so one library may not work with another of a particular version. This isn’t a problem purely with Python, but can happen in other languages where you are using external libraries.

 

conda helps to resolve these versioning conflicts and pip will also inform you about any conflicts. However, if you have many packages (and especially if you’ve defined specific versions of many of the libraries you need installed), you can find that conda basically stalls. It can’t do anything because it can’t find any combinations of packages that work with one another.

 

In this scenario, you need to find the problematic libraries and sometimes play around with the version of the libraries you want installed you want installed. If you are finding that conda is freezing because of these version conflicts. Sometimes, it’ll give you output on which libraries are causing the issue, other times it’ll hang indefinitely.

 

Here are things to troubleshoot, you:

 

  • can consider installing more libraries using pip, rather than conda, in particular those simpler libraries that don’t have a complicated setup process.
  • can also try downloading libraries different conda channels (eg. anaconda) to see if that helps
  • selectively exclude libraries which could be causing issues from your conda installation
  • need to avoid the situation where the same library (usually Pandas or NumPy) which ends up being installed by pip and conda with different versions, which will cause you pain, pain (and even more pain).

 

If you’ve got any tips or tricks for troubleshooting this, please do let me know too! There are situations where you might prefer to install everything using pip, particularly, if you want to have a lightweight Python installation for production purposes (eg. if you are using AWS Lambdas, Docker containers etc.). The downside is you might need to spend a bit more time installing more complicated Python libraries which have external non-Python dependencies.

 

Conda environments, virtualenv and Docker

One thing I’d really recommend is using conda environments if you are using Anaconda or virtualenvs if you are using another variant of Python. This enables you to create multiple environments with different libraries and even different versions of Python on the same machine. If you mess up any of the environments, you can just delete and start again. Generally speaking, your base environment should be kept relatively clean, rather than installing everything in it. If you break your base environment, you’ll have to reinstall Python again. With conda environments, you can also save them down in YML format, and reproduce them elsewhere. With pip it’s possible to “freeze” the environment and save it down to a requirement.txt file.

 

If you want to go one stage further you can Dockerize your conda or virtualenv environment, combined with any other external dependencies (eg. databases etc.) to make it easy to deploy your setup anywhere. For my teaching, I’ve provided conda environment.yml files as well as batch scripts for installing the libraries.

 

Conclusion

Getting the right Python environment installed can be confusing for beginners, and even if you are more experienced, it can be problematic, when you want to use lots of libraries. However, with a bit of persistence it can be possible to create a robust environment that you use for your data science work.