Mince pie data


The sun is slowly receding, the sky is darkening early each day, the leaves have fallen: signs that Christmas is drawing near. When I was younger there was something which puzzled me most of all about Christmas. That something was the humble mince pie. I could not quite understand where precisely the mince was, since it tasted of anything but minced meat and was more a myriad of sweetness, than savory. Indeed, it seems I was not alone, in my youthful mince pie bewilderment, judging from a Twitter exchange on the subject just a few days ago initiated by @moyeenislam. Whilst, it might be called a mince pie, in the UK, as I found out later, they no longer contain any meat, (although somewhat irrelevant to the rest of the article, when it comes to Christmas pastries, I’d pick stollen over mince pie!). So mince pie as a term is basically a bit of a misnomer.


When it comes to data, I often wonder whether alternative data is a term which can be misunderstood, just like mince pies, because it might actually cover datasets which at origin are not really “alternative” at all. For example, let’s think about communications from the FOMC. It isn’t really something “alternative” in one sense, given we have had access to FOMC speeches, statements and so on for generations. Traders and economists have mused upon what the FOMC has been saying for years. In this instance, what is “alternative” is that we can use new techniques to understand FOMC communications in a systematic manner, most notably using NLP (natural language processing) to analyse them. Indeed, I’ve spent a lot of time trying seeking to quantify FOMC communications, using NLP. In effect, what’s alternative in this context, is the conversion of unstructured data (FOMC communications) to structured data (time series of sentiment), rather than the underlying data input.


Of course, there are many datasets, which really are alternative by the very nature of the way they’re collected. Recently, I did a project looking at using proprietary data associated with internet searches related to Investopedia. Whilst, the focus of the research was an unusual dataset, the approach of how to use it, was structured around a very simple idea, can be understand investor anxiety based on internet searches (indeed, Investopedia have created the Investopedia Anxiety Index to do just that!).


The key point with all these datasets, alternative or not, is that we should try to use those where we have a hypothesis. Just because a dataset is alternative and exotic in nature, doesn’t necessarily make it magically useful. We need to have a hypothesis of how to use it, and why it is relevant for traders and why it is worth spending time to investigate it. Let us for a moment return to our humble mince pie. Let’s say a chef want to make an even better dessert than a mince pie, one approach would be simply to choose the most expensive ingredients possible into a mixture and ignore anything cheap. I somehow think a truffle  and caviar dessert is likely to taste awful… it’s the everyday ingredients like sugar and butter that matter more for a dessert! In the same way, randomly mixing together exotic datasets without much thought beforehand, is unlikely to result in a robust out-of-sample result (even if we might be able to use data mining to get a nice in-sample result).


So, yes, do take a look at alternative data (I certainly do!). However, don’t forget all the other data sources you’ve always used to use alongside alternative data. Sometimes, you might have overlooked those datasets you’ve always had access too, but never really thought about analysing more closely (perhaps in a novel way). Most of all, a hypothesis is just as important with alternative data, as it is with any sort of data, you choose to use during trading. If you want to read more about my thoughts on alternative data (and open source), take a look at my interview with International Business Times and my short clip on Bloomberg TV, where I discussed this recently.