Selecting datasets for financial models

Any building is made of many materials. Some will be pretty much common to every building, such as glass. Other materials will be more specific. Take for example bricks, they are fairly common in residential buildings. However, these days it is fairly unusual to see the outer walls of a skyscraper made of bricks. Instead, the facades will typically be a mixture of materials glass and steel, whilst the core will typically have concrete and steel. Then we have reams of copper for electrical cabling and so on. How do civil engineers and architects decide upon the materials that will used? I really don’t have much of an idea, although I suspect it comes down to a mixture of design and practical considerations, like suitability and cost.


When we come to financial models, what are going to be the raw materials for our “building work”. In this case, our raw material obviously isn’t bricks and mortar, it’s data. We need some way of selecting which data to use in our “building work” before we get started. In the Book of Alternative Data, Alexander Denev and I, discussed at length the types of considerations you might have when it comes to selecting datasets for a particular model. You might have various checklists to go through to help shortlist particular datasets for the later research phase. These checklists can include, selecting datasets by frequency, asset class applicability etc.


These broad based checklists are a crucial part of the process. You might also consider further narrowing your shortlist, by working backwards from your specific use case, using your domain knowledge specific to markets to help narrow our dataset choice. We can illustrate this with an example. Let’s see we are trying to model monetary policy and in particular the Fed. Let’s first think about the Fed’s mandate. Clearly, the inflation rate and unemployment rate are clearly important.


If we take a step back, we can think about ways of modelling each of these economic variables, and in particular, we can consider what types of datasets would be useful, before we start number crunching. What proxies can we think of for these specific data points (other economic time series?) and would alternative data be helpful? When selecting alternative datasets, we need them to be relatively representative of the broad economy or at least they could be used as a bellwether. In many cases we start from a very macro idea, such as retail sales, and keep going backwards till we reach the individuals responsible for that dataset, consumers themselves and datasets they actually create like credit card data. We also need to think more broadly about market stability. What markets are most important when it comes to monitoring by the Fed? Clearly, the rates market is important for understanding the Fed, to see what it is pricing for future monetary policy. Stocks and risk sentiment are also important etc.


It is also important to consider the datasets which the Fed themselves produce, not purely third party datasets. These include their own forecasts of growth and inflation. There is also a plethora of Fed communications we can look at, to gauge sentiment using natural language processing (and as a bit of a plug, Cuemacro produce a Fed communications dataset, if you’re interested in purchasing that!).


Ok, so my Fed example is quite specific. But the key point is that we need to start from the problem we want to solve, eg. the Fed, modelling a specific earnings number etc. From that, we keep going backwards to try to look at the variables we want to model as part of that. If we keep going further back, we’ll quickly be able to identify datasets that can help us, which may not initially have been obvious. Clearly, having an understanding of the types of datasets out there to start with is also helpful, but I would argue the key factor in all of this is having good knowledge of the markets. Without that, we might end up picking totally inappropriate datasets, and wasting time unnecessarily in our research process phase.