Data scientists and coding

Data scientists code. There’s nothing controversial about this statement. Without coding you’d be limited to the types of tools you can use to analyze data, such as Excel. However, with coding we have access to libraries like pandas to work with time series, scikit-learn to do machine learning and so on, to build models. In a data science team, you’ll have other key coders too, in particular data engineers to manage the data pipeline, the storage of data etc.

 

When it comes to moving the models developed by data scientists into production, there will be a lot of interaction with data engineers to help productionize the code. The key question is what type of expectations should there be in terms of the type of code written by data scientists? Ultimately, data scientists main role is to analyze the data, and have skills across many different areas (coding, stats, domain knowledge etc.), rather than to be expert developers and understand things which are a key part of a data engineers skillset like distributed systems. We can’t be experts at everything!

 

However, it’s still beneficial for data scientists to adopt certain elements of the software engineering toolkit throughout the research process from the start. I’m sure many of these suggestions are already being used by data scientists. Good coding practices should not only be the realm of data engineers in a data science team.

 

Using version control seems like a good start, to be able to easily track changes. Another one is refactoring the code throughout the research process. The start of the research process might involve doing some work on a Jupyter notebooks to analyze data.

 

As we go through the research stage, refactoring the code can be beneficial, taking code out of the Jupyter notebook and into separate Python modules. Well written code will speed up the process of data engineers to productionize the code and create a data pipeline. If the code from a data scientists “works” but is poorly written, it can be more difficult to know what’s going on and for others to make changes in the production process eg:

 

  • it is not written in a modular fashion, and functions do multiple tasks
  • has very few comments
  • has poorly chosen variable names
  • copy and paste code everywhere
  • has no unit tests
  • has no abstractions
  • has all the code in Jupyter notebooks and no common code factored out into a common library
  • etc. 

 

Data scientists aren’t expected to be experts at everything. However, if they adhere to good practices in coding it can make the move to production easier when that code is handed over to data engineers. This may require training and education, but it will be worth it.