People usually ask me several types of questions. The most common one is where I think you can find the best burger is. I do have an answer to this (which may or may not be a secret), but in order to validate this response, I’ll need to do a lot more burger based research in the coming years. There’s obviously a financial (and possibly health related) cost to this research, but I’m willing to make the sacrifice to find the perfect burger.
When it comes to work related questions, the subjects I usually get asked revolve around FX, alt data and Python etc. When it comes to alt data, a pretty common question I get asked is the following: what dataset should I look at when trading or some variation on this? There is no generic answer that works for use case. In The Book of Alternative Data, Alexander Denev and I, went through quite a few different different use cases for many a number of different asset classes and datasets to give a flavour. At the current time, there are thousands of alt datasets from many different vendors, so having to pick a single dataset from this list is extremely difficult, in particular to do it properly, would require a lot of due diligence and research time. For large funds, they have dedicated teams of data strategists sifting through datasets to shortlist appropriate ones, and then research teams to do the number crunching for a number of those.
The answer I usually give is that it depends on how much structuring work they want to do to a dataset. If they are willing to spend the time (and have a budget), text based datasets can be a nice place to start. One reason is that there are a lot of text datasets out there. In many cases vendors will have already structured the text for you, adding useful labels to help you make sense of the text. You can also augment with your own webscraping work, if you have enough time as well, given you’ll have to do additional structuring process yourself using NLP, which can be challenging. Python does have lots of NLP libraries too.
Each text dataset requires a different approach. Social media has a different tone to say newswire data, and the use cases can be different. Social media might be good at gauging the buzz around certain keywords, but newspaper articles will have more depth and catch more nuances around the subject. There’s also an issue of trust when it comes to social media. With a well respected financial organisation, their customers pay them to get as accurate news as possible. With social media, that’s not the case, the more outlandish a claim, the likelihood of more retweets etc.! Social media can be a valuable dataset, all I’m suggesting is that, by combining multiple datasets (eg. news with social media, and this can include having more than one news/social media dataset too), you’ll probably get a better signal. Indeed, this is a general approach we discussed in the book, it’s not usually the case that you’ll use one dataset in isolation. Instead, you’ll often have many that you use together in a trading model, this can also include trying different types of alt data, eg. consumer transaction data combined with car counts derived from satellite images to gauge EPS for a retailer.
There’s also a lot of flexibility in a text dataset. You can aggregate sentiment metrics, to get an idea of market positioning. There’s also the possibility of using newscounts to get an idea of news supply. Sometimes you might also be able to get an idea of the demand for news if you have readership statistics, or things like retweet numbers. (In the book, we go into a bit more detail about the difference between news demand and supply). Text is used in many different asset classes, primarily in equities but it can also be used in FX and macro (although admittedly, I still think not enough folks look at it in macro, which is probably good if you actually are!)
There is no perfect answer, for which alternative dataset you should try. The exact dataset depends on your use case. Use cases can vary massively between investors, depending on their asset class, trade horizon etc. However, as a place to start, I think text can be a nice choice. There’s lots of ways you can utilise text datasets, and many possibilities ranging from social media to newspapers etc. which can compliment one another.