Machine learning the mundane

20171119 New York

Try to think of the “buzziest”, buzzwords you can think of. Support vector machines? No. Random Forests? No. However, group these together (and much more) under the umbrella of “machine learning”, and suddenly we have created a buzzword! The basic idea of many of the techniques which underpin machine learning is find relationships between variables. In particular, we do not need to specify the form of the relationship beforehand (eg. linear). In fact the relationships might be highly non-linear. We need to be careful though that we are not data mining too much, and end up fitting to noise. For example, let’s say that there’s generally linear relationship between two variables. If we have a very small number of observations, the “best” fit found using a machine learning technique could involve actually joining up the points, as opposed to doing a straight line of best fit. This will obviously fit nicely in our sample, but then when we throw new data in, we will likely find that a straight line of best fit could work better. The difficulty with finance is that relationships tend to be less stable (financial time series are not stationary), and often we don’t have sufficient data. There is a trade off between optimal solutions in sample, and robustness out-of-sample when we create a trading rule. In other words, we want to have at least some fitting, but equally, overdone and it will just look on paper and not when we’re actually running real money.


A funky way to use machine learning might be to infer trading rules directly and this is often the most obvious question to ask. Throw in many, many features based on market data and similar datasets, and see if it can infer trading rule which maximise P&L. However, this is very difficult to get to work, in a live out-of-sample setting for the reasons described above. In practice, we have a better chance of success (out-of-sample) if we create features which we think have an intuitive rationale. Some of these might be based on alternative datasets. We might also use machine learning to classify different types of market, to help us to overweight or underweight signals.


Perhaps a less glamorous way to use machine learning in finance, is to help with the mundane tasks. Many of the most successful applications to do with machine learning outside of finance, have been to solve problems like image classification. The great thing with this, is that the problem doesn’t change (unlike financial time series). A car looks like a car in an image, and this doesn’t really change from year to year! So perhaps if we want to use machine learning in finance, rather than directly trying to tackle problems such as the creation of trading rules, what about the bits around them, which take up large amounts of our time? One area can be in natural language processing to identify the sentiment of a news article, which can be used as an input into a trading rule. I’m doing a lot of work at present on this, doing a project for Bloomberg on articles from their Bloomberg News wire.


Another area we can attempt to use machine learning for, is preprocessing and cleaning of data. Cuemacro has created an index for measuring the sentiment of Fed communications, which involves natural language processing. The end index has a statistically significant relationship with the change in UST 10Y yields, which seems intuitive. What was the most time consuming step? A large time was spent in the very nontrivial step of collecting together all the Fed communications and speeches and sorting, before doing any sort of natural language processing or index construction. Cleaning and structuring data can be very time consuming for many datasets in finances, whether we are looking at market data or more complicated datasets such as text.


Text data can be sourced from many areas including databases, but also increasingly from the web and also social media. What about trying to read numerical tables from PDFs or webpages? We can of course try to create solutions by hand, with lots of specific rules, for extracting the text we want from webpages or PDFs. However, in practice, maybe it is something we could try to do using machine learning in a more automated fashion, for it to discover the most important parts of text and ignore those that are not – like the header menus on website (I’m not claiming to have solved this problem by the way!)? If we could do this effectively, we would have freed up a significant amount of time to do the most valuable parts of research, namely thinking about ideas and implementing trading rules themselves on top of these datasets.


Sure, the most obvious idea for using machine learning is to discover trading rules (but can be very challenging for the reasons described above). However, perhaps it’s the more mundane ideas, such as cleaning and classifying data, where it’s worth thinking about machine learning too? Ok, it’s not quite as funky, but generating nice and neat structured datasets from unstructured sources is often an important step in creating a trading strategy. Time saved there, can be used elsewhere.