The gorilla & Radiohead in the backtest room

20170708 Backtest

A couple of years ago I remember visiting the opening of an art exhibition in Amsterdam, with a few of my friends from university. I can’t really recall many of the paintings, probably more a function of my memory, rather than the art itself. One thing I do recall was a painting a living room, which happened to have a primate, either a gorilla or a chimp, a sitting in the middle of the room. For sake of argument, let’s just assume that the primate in question was a gorilla. What was the meaning of a gorilla being in the room? Was it somehow related to the saying of there being a “gorilla in the room”? Luckily, the artist was there to answer. One of my friends asked him, why there was a gorilla in the room. The artist simply replied, that he just wanted to draw a gorilla. There was no profound meaning, no subtle subtext, no message. He simply wanted to draw a gorilla. Hence, whatever inferences an observer might be making about the meaning were, in all likelihood wrong.


More recently, during the Glastonbury music festival, Radiohead was one of headline acts. I never really used to like Radiohead when I was younger. Yet, somehow in recent years, the ethereal nature of their music somehow chimed with me. There was an amusing story about Radiohead that whilst they were tuning up their instruments at the start of their Glastonbury set, some of the audience supposedly mistook this for a new track. Whilst, the story later turned out to be a total hoax (report from NME), the idea does illustrate my point about reading too much into something that is effectively random.


Why am I retelling this (slightly odd) story and talking about Radiohead? Just bear with me for a second, and I will attempt to answer! I recently wrote an article about errors you could do in backtesting in a trading strategy. It was more related to coding problems you might encounter which might make your backtest a poor representation of actual trading performance historical. However, I got a lot of comments asking why I hadn’t talked about data mining? Hence, I thought I’d tackle this issue here. Let’s say your backtest is a “perfect” simulation of how it would have performed historically. You have put all the right transaction costs, calculated all the signals correctly etc. What you are probably most interested in understanding is whether this historical performance is likely to be replicated going forward? Data mining basically involves fitting the parameters of your trading strategy to work as well as possible historically. Doing this is likely to make the model less robust in the out-of-sample period (eg. the future). Are we seeing historical performance because there is actually some sort of persistant behaviour or because we have data mined the strategy to death? Returning to our painting story: are we trying to come up with explanations why the “gorilla” is in the painting, when in practice there are none or believing that random tuning notes are actually an intentional melody? How can you avoid data mining? What can you do show that your backtesting is not simply the result of data mining? Here’s a few ideas below (this is by no means a full list!)


Do you have a hypothesis first for the trading rule?

As I have written many times before, having a hypothesis helps to avoid excessive data mining. If there’s a good rational behind a strategy, it can help to give us confidence that the results we are seeing are not simply the result of data mining. Do we think that our hypothesis is likely to persist into the future? It can also give us confidence about the performance of the strategy going forward if we have a strong hypothesis. In practice, we can never precisely forecast the performance of a strategy in the future, but we can do our best to increase the likelihood that the performance can persist.


Have you done some sensitivity analysis?

We will usually end up doing at least some element of data fitting. By simply choosing a certain hypothesis to test, we are already “cherry picking’ (because we are not testing other strategies we could have done!). When we examining our parameters, do only a very particular set of parameters work? Does the performance of the model suddenly fall off a cliff, if we adjust parameters slightly? If this is the case, maybe our strategy really isn’t that robust after all and we should avoid using it.


Do you really need this many parameters?

The more parameters you have, the more fitting you are likely to be doing. Ask yourself, do you really need to have this many parameters in the trading model? What are all the parameters doing as well?


Does the strategy work with other assets?

As well as holding back some data in time for out-of-sample testing, we can also choose to hold back some assets for later testing. This can be another way to check the robustness of our strategy. Of course, it’s not necessarily the case that a strategy will work with all assets in a specific asset class, there might well be specific reasons why not. Let’s say we have a really good way to estimate moves in the crude price, it is likely that this will be more useful for energy related stocks, rather than say some random group of equities. For general factors (for example “trend”), we would hope that a strategy would work across a wide variety of asset classes as well. In practice we might still need to tweak our model though to account for differing levels of liquidity. Also, some generalised factors will just need to be expressed differently in different asset classes.


Can just a simple rule work?

If you use simple models for our strategy (let’s say linear regressions for sake of argument does the strategy work? If it doesn’t work with simpler models, have we got a plausible explanation why it isn’t work? Are we using something more complicated for a good reason, or are we just spending far too long to tweak the strategy? In practice, we want to use the simplest model possible that captures the behaviour we are studying (ie. Occam’s Razor). I have never found a correlation between the profitability of a trading strategy and how complicated it is…! Complexity does not necessarily mean P&L!


Is all the performance coming from one event (or a very small number of trades)?

Very often, when there are big market events, such as 2008, we might find that if we were on the “right side” in our backtest it makes our strategy perform a lot better. Is our backtest basically just long stocks, when stocks were rallying our sample? We should look at the performance in different sub-samples as well, to avoid this. To use a musical analogy, we don’t want our model performance to be simply a one (trade) hit wonder.


How much time have you spent developing the trading rule?

It can take a lot of time to create infrastructure for backtesting, collecting and cleaning the data. I’m not really talking about that stage. If anything, it’s worth spending a lot of time here! I’m mainly referring to the period of actually developing a trading rule  (I’m trying to exclude problems, which by their nature need more time like natural language processing or hard mathematical techniques). Essentially, have you spent a lot of time on what appears to be a simple trading rule? I’ve found that the more time I’ve spent on a trading rule historically, the more the chance that I’m just basically data mining my way to a solution.