How to judge a forecast

Let’s say I go to a burger joint, and want to forecast how good the burger will be. First, I need to use a bunch of input data. This can include observations derived from questions, such as how many people are in the restaurant? How much is the burger? What are the ingredients of the burger? Do the burgers coming out of the kitchen seem good? I might even ask some of the diners whether the burgers are good (but this might get me thrown out of the restaurant)? I could look at TripAdvisor too. If I know burger experts, let’s call them G. Ramsey and B. King, I could ask them too. Ultimately all this would go into my mental model, and I’d decide whether or not to buy a burger based on all this input data. Obviously, the true test, would be trying the burger myself, but I can’t possibly try several hundred burgers in a city (although I’m going to try…)


It’s not just when we’re choosing a burger joint, that we want to be able to forecast. If we are trading financial markets, we are continuously making forecasts (whether explicitly or not) about the economy, companies, assets etc. One of the big questions that involve forecasts, is trying to understand how good our forecast is. It’s a constant question which is very important at Turnleaf Analytics, the firm Alexander Denev and I cofounded to do economic forecasting using machine learning and alternative data.


How can we judge our own model forecasts? The first thing to note is that we need to have a lot of forecasts to make any judgement. It is not representative to pick out a single forecast and then assess that in isolation. It’s like trying to judge a trader based on one single day in the year, and then ignore every single other day! Ideally, you’d like to have many years of forecasts. We started publishing to clients in May 2022. Whilst we do not have say 10 years of live forecasts, we do have a large cross section of live forecasts over many countries and forecast horizons. We have done over 2000 live forecasts, which is a decent enough sample size to make statistical observations. We can of course backfill our history with a backtest of our latest model. Of course in general, live/out-of-sample data is preferred for making evaluations. One thing we can also do is to split up our analysis for backtest vs. live.


Once we have a large enough sample of historical forecasts to analyse our models, we need to decide what sort of metrics to compute and furthermore what sort of benchmark to compare our own model forecasts to. One of the simplest is mean absolute error/MAE (or we can use others such as root mean squared error). We can compute this for all our model based forecasts. For comparison we can compute the MAE for benchmark forecasts derived from other sources, such as central banks, surveys etc.


We need to compare like with like. If we are looking at our model forecast for say 12M ahead, it would not be fair comparison to compare that figure to a short term consensus reading that is published a few days before the release of a CPI print. Clearly, only a few days before a CPI print, there would be a vast amount of data available to input into a forecast, which would not be available a year before.


By contrast, if we compared our own model nowcast, which we publish for major countries like US, it would be fine to compare it to a short term consensus forecast, given that both our model and short term consensus would be published at comparable times. Perhaps unsurprisingly, as you attempt to forecast longer forecast horizons, the error increases. In the very short term, errors tend to be much smaller.


So what happens when we number crunch for MAE for our models and the benchmarks? The MAE of our models is around 20-30% less than those of our benchmarks in live publishing. Around 63-64% of the times our models have been closer to the actual prints compared to our benchmarks.


Whilst it is fairly easy to calculate metrics such as mean absolute error, you might ask are there other ways to judge the effectiveness of our forecasts? If you are trader, this basically means that you can monetise a forecast through trading. Some forecasts can be super accurate but can’t be monetised! I am pretty good at forecasting whether a burger will be good, but that will never help me trade USD/JPY!


If you are trading inflation swaps, the payoff is directly tied to the actual CPI print. However, many market participants are not trading inflation instruments. Despite that, their markets can be heavily impacted by inflation. Hence, we would conjecture that an inflation forecast would still be something they could monetise. There are several ways to do this.


One way can be to add our inflation forecast data into a trader’s existing asset forecasting models. Does our data help to improve their own forecasts? However, this might not necessarily capture the nuances of the data being added. Another, but somewhat more difficult approach is to craft a trading strategy built specifically for the dataset in question, using a bit of economic intuition. We’ve tried this approach for a number of assets. We’ve written research with lower frequency systematic trading rules for FX and bond futures using our inflation forecast data, specifically built on the relationship between inflation and monetary policy.


More recently, we wrote a paper looking at shorter term trading rules for macro assets around US CPI, using our US CPI nowcast as an input. Perhaps unsurprisingly, we found that there was strong relationship between the US CPI and a multitude of macro assets, not just US inflation swaps and breakevens. EUR/USD, USD/JPY and US 10Y Treasury futures moved significantly on any US CPI surprise, in a direction we might expect given intuition (obviously, we don’t know the surprise in hindsight…).


In other words, when inflation is better than expected, USD appreciated, US 10Y futures sell off, but when inflation is softer, the USD sells off, and US 10Y Treasury futures are bid etc. We created a trading rule for these various macro assets using our US CPI nowcast, which had risk adjusted returns of around 0.9 (and if we were able to use hindsight to ascertain the surprise exactly, this figure rises to nearly 1.4, but alas we can’t!). 


There is not only one specific way to judge a forecast. In all cases, we need to have a large sample – we cannot judge a single forecast in isolation. Some metrics like mean absolute error are fairly straightforward. In all cases, if we are using a benchmark, we need to make that we are comparing like with like. Ultimately for a trader the true value of a forecast is if they can monetise it through trading. Creating a trading strategy based upon a forecast can help answer this question. Admittedly, coming up with a trading rule is not always particularly easy, and my experience is that it does require a bit of lateral thinking. However, if you can do it, it’ll help to ascertain the value of the forecast for your trading.