It’s all about the data

I used to be an avid Formula 1 fan. I do still watch occasionally, but admittedly not as much as usual. Ok, there’s only one driver in the car, so on a superficial level, it doesn’t appear to be a team sport. Each team has two drivers, and in many instances they will race each other, although you can still have team orders. At every race weekend the spotlight is on the driver. However, it is perhaps even more of a team sport than might appear. Building the car is a massive team effort. Teams will have engineers, data scientists etc. At every race, there will be mechanics taking care of the cars behind the scenes, and in front of the cameras they will be taking part in pit stops. 


Several months ago, Alexander Denev and I co-founded Turnleaf Analytics with our focus being on inflation forecasting using machine learning and alternative data as part of the process. As you might expect, the problem of inflation forecasting is complex and there are many parts to it. Whilst the specifics do vary, there are common steps to solving many financial data science problems. The “glamorous” bit of it is creating a forecasting model. This can include for example deciding which technique you’ll use, whether it is going to be linear regression, ridge regression etc. It is also important to understand whichever model you are using and whether it is suited to the problem at hand. We also obviously need to have a handle on many ideas from statistics, such as the difference between a training set and test set etc.


However, the question of what model we might use is often a later part of our pipeline. Models need data to work! What you need to do before we get on to the model is to curate your data and to clean it. We might ask questions like, which datasets are we going to use, and are there particular variables that going to be of interest? How are we going to clean the data? Are there missing data points or outliers? How will we deal with them? The data stage is probably the “unglamorous” bit and also extremely time consuming. However, it’s just as important as the “cool” stage, when we are actually building the models, selecting what algorithm to use etc. With poor quality data, you’re going to make it much more difficult for your model in the later stages of the pipeline.


It’s also at the data stage, when we are selecting and cleaning the data, that you need to have specific understanding of the problem at hand and some familiarity with which datasets are available and their quirks. With economic data, for example, we need to deal with multiple timestamps, such as the release date, the date for which the data was collected etc. We need to have an understanding of what types of variables will likely be important for our financial forecast as input variables. Different markets will often have different drivers. This is also the case with economic data. Data we use for HFT is not going to be same data we use in other scenarios. If we have no domain expertise, it is going to be difficult to have an idea which financial data to use to start with, no matter how much machine learning techniques we understand. If we need our model to backtest a trading strategy, we need to have an understanding of total returns, otherwise, our results could be totally wrong. 


Forecasting inflation or indeed many other economic or market variables isn’t easy. Of course choosing the right model and understanding it, is a key part of the process. However, having a good quality dataset which has been curated and cleaned properly is a very important part of the process. This will feed into the model itself. Otherwise, we are basically the Formula 1 driver turning up at a race, without a car to drive!