Combining different data sources

I used to like Lego when I was a kid. Perhaps a part of me still does. Forget about nostalgia too, the Lego sets available these days are much better than I was a kid. Some Lego sets are exceptionally detailed copies of major buildings, cars etc. They are also considerably more expensive than I remember when I was a kid. But the key point with all these Lego sets, is that they are all made of Lego and made by the same company . There aren’t random metal pieces etc. It’s all one material. The fact that everything is a Lego block, makes it easier to build.

 

With financial data, if only all the data snapped together easily! In practice, when researching financial markets, you might be handling datasets from many different sources, which you need to join together. The way you download data will probably vary, because each vendor has a different API (my findatapy open source library tries to alleviate this problem by creating a common API for many different vendors, including Bloomberg, Quandl etc.).

 

Even if the data is structured, the format will vary between different datasets. Price data might appear simple, but different vendors will likely have different tickers for the same asset, and field names will vary too. Hence, you will have to normalize to a common ticker, either one you make (if it’s a relatively small number of assets), or more likely a common standard like FIGI. You’ll also want to have common field names.

 

With economic data you’ll have to deal with more challenging point-in-time issues, with multiple revisions of the same point, which you need to keep track of. Then when we get to alternative data, the challenge of structuring the data is even more difficult, such as tagging data points with the appropriate asset ticker, or understanding text. Having said that, these days there are many open source libraries to help in areas such as NLP, and also many data vendors will sell more structured data which is easier to consume.

 

Once the data has been cleaned and structured, there can be other preprocessing steps that might be required. The various datasets can also be of different frequencies, which you’ll need to take into account and possibly require resampling. You might also need to transform the data as appropriate before you begin to analyze it such as asking:

 

  • do you need to calculate returns?
  • do you need to smooth the data?

 

Working with many datasets can be tricky, but it is often necessary. If you don’t do it, you are limiting the datasets that you can use in your models. Often it might be the case that there isn’t one “killer” dataset, it is more the combination of datasets which will yield a signal. All these steps I’ve described so far are very time consuming, but necessary. Only once you get through all the initial steps, like collecting/downloading the data, harmonizing tickers, preprocessing it etc. can you begin to analyze the data, feed it into any regressions etc.