Machine learning for data cleaning

Every data scientist wants to do the fun stuff, ie. building models! You might be forecasting a variable. Perhaps you are classifying a variable into a specific group. Machine learning is usually part of this, whether it’s a simpler linear model or something more complicated that can take into non-linear relationships, whether that is a SVM or all the way to a deep learning model.

However, before you get to the stage of being able to fit a machine learning model, you need your data to be properly prepared and cleaned. If you haven’t gone through the laborious process of cleaning it and preparing it, then your model is going to have problems (no matter how fancy your model is). If we want high quality outputs from a model, we are going to need to provide it with high quality data.

In practice, machine learning can be used heavily when preprocessing and cleaning a dataset. At Turnleaf Analytics, which Alexander Denev and I cofounded earlier this year to forecast inflation using machine learning and alternative data, a lot of our time has been spent sourcing, cleaning and preparing the dataset. Of course, some of this involves laborious manual work. Whilst a lot of the focus on machine learning is on the modelling stage and amazing models such as DALL-E for generating images from a text input, what would also be useful is for machine learning to help us prepare and clean a dataset (see tweet below).

DALL-E this and Stable Diffusion that, but we still can't copy and paste text directly from PDFs and keep the formatting.
— Vicki (@vboykis) August 28, 2022

As we wrote in The Book of Alternative Data, we can use machine learning techniques to identify outliers in data. It can be used to help impute missing data. It is possible to parse PDFs and get the text. However, as the tweet above infers, it can be difficult to keep the formatting. However, even here, there are lots of libraries that can, for example grab tables, from PDFs using machine learning models. Indeed, in computer vision more broadly, machine learning models are often preferred these days, compared to more rule based models. Of course, it does take a bit of trial and error to get these to work to grab tables, and really does depend upon the formatting of your original dataset, as to how complex this process is. For us, being able to read PDF in an automated way, has been very important, because at Turnleaf Analytics, many of our datasets have originally been in a PDF format, from sources like national statistical agencies and central banks.

So, yes, machine learning can be extremely useful for when we are trying to fit our model for forecasting whether that’s inflation or any other variable (or if you want to do something DALL-E-sque!). However, machine learning can be just as useful earlier on the process, when it comes to cleaning and preprocessing our data, even if perhaps this step isn’t the “fun” part of data science.

Data, General

Machine learning for data cleaning

by Saeed Amen • August 29, 2022

Post navigation