A few ideas from natural language processing

We’re all smarter than computers! One area where we have a particular advantage is understanding human language. For computers, human language is, perhaps (excuse the pun) somewhat inhumane. After all, computers like numbers. Humans (well, some of us) also like numbers, but to communicate, unsurprisingly, we stick to using human language. It’s sheer ambiguity is what makes it human. If language were wonderfully precise, literature would somehow be very dull! Over the past few years I’ve become increasingly interested in natural language processing, the area of study which looks at how computers can understand human language, and how such analysis can help to forecast markets.


I’ve done a number of projects looking at text from newswire organisations, examining how it can be used to trade FX systematically, including most recently a project for Bloomberg, examining text from Bloomberg News (a summary of the project is on Bloomberg’s website). Alexander Denev and I are also co-authoring “The Book of Alternative Data”, which will be out on Wiley next year. Perhaps, unsurprisingly, part of the book will be devoted to text in its various forms like news, social media etc. and how it can be used for trading, alongside a lot of specific use cases. I just recently starting adding to this section, and as a such I thought it would be useful to do a very (very) quick summary of the topic. I also recently went to talk by Flavia Poma at Quant Invest Europe, which gave a very nice introduction to natural language processing.


In order to make text usable for trading (or indeed for most purposes), it first needs to be structured. The basic issue with text is typically it’s not in a standardised form, ie. unstructured. Structuring it typically involves adding descriptive fields (metadata), which give a bit of insight into what the text is about. This might involve tagging entities (eg. countries, tradable assets, well known people). Often data vendors will do a lot of the heavy lifting for you in terms of structuring a text. If we want to trade EUR/USD based upon news articles, we first need to have an idea of which articles are important for trading it. Clearly, expecting every news article to impact EUR/USD is not realistic! 


It can also often involve identifying the sentiment: it is generally positive or negative in tone? Before any of this is done though, we’ll need to work on tasks such as word segmentation, identifying what are actually words. It isn’t always as easy as simply splitting a space, given “words” such as “Burger King”, which need to be viewed together to have a specific meaning. It also depends on the language, eg. Chinese needs different word segmentation algorithms.


We generally need to convert the text into some sort of numerical form to make it easier to deal with, which are known as word embeddings. One of the simplest word embedding is a vector which gives you the frequency of each word. There are more involved ways of constructing a word embedding such as using the word2vec algorithm, which instead gives you the probability of occurrence of a word. The word2vec representation can help us to understand the relationships between words. One example you see very often is word2vec applied to all the text in Wikipedia. These vectors can then be used in further processing work, such as the creation of a sentiment classifier.


You find famous results such as king – man + woman = queen (explanation here), which have simply been learnt from the text, as opposed to being taught as a specific rule. Indeed, this is an example of a more data driven/machine learning approach to NLP, as opposed to a more traditional rules based approach to NLP. The difficulty with a rules based approach is that the rules of language can become very complicated and require a lot of specialised work. 


The great thing is that these days, we typically don’t need to write all our own NLP libraries from scratch. There are some open source libraries for Python such as NLTK to get us started. However, even though this is the case, we do need to think carefully about the task we wish to solve. In particular, having domain understanding is very important, such as understanding how news impacts your specific market.


If you’re interested in NLP, hopefully, you’ll take a look at our new book next year, The Book of Alternative Data! In the meantime, if you’re interested in the subject, and would be interested in Cuemacro working on a project for you in NLP space (or indeed anything FX/alternative data related), feel free to drop me a message!