Nobody likes doing the boring stuff. We all want to go on holiday, but packing our bags is not the fun bit. We all want to have a nice burger (well, I do), but queuing for ages at an ever popular burger joint is not what we want. Traders all want to have positive P&L, but well negative P&L happens, sometimes. When it comes to data science, in whichever industry, cleaning data is going to take up a large amount of time. The statistic I often hear, that cleaning and loading data usually takes up around 80% of the time of a typical data science problem, leaving around 20% of the time, for doing the fun analytical stuff. Having a good and clean dataset is obviously a prerequisite for doing good analysis (we want to avoid the garbage in and garbage out scenario). If you are developing a trading strategy with poorly recorded price data, it will be more difficult to make inferences.
Precisely which cleaning/preparation steps you might do are dependent on the type of data you’re dealing with. Price data is going to have very different properties compared to data derived from news for example. If you’re cleaning and preparing data to develop trading strategies, there are a few basic steps you can take. This isn’t an exhaustive list, more some random thoughts of quick approaches, which I sometimes use when I get a new dataset.
Calculate intraday rolling realised vol for prices data
If you use a range based volatility measure with relatively short windows, on a large time series, it can often be a good initial step to identify unusual behaviour. It’s useful to use a range based metric, rather than for example the close of every minute, since that will basically ignore any points within the minute (which you also want to check for tick data). In pandas it’s fairly easy to resample data for regular intervals so it gives you the open/high/low/close for each interval. You can try comparing your intraday rolling vol with the same metric calculated for price data for the same asset but from another source (eg. in FX, this could be from a different ECN). You can also try to compare whole days of realised vol with overnight implied vol in FX, at least for the more liquid assets where this can be available – the match here is unlikely going to be as close though in this instance.
Do vol spikes in prices occur at times where you’d expect them?
We can also use our intraday rolling vol to assess whether spikes occur where we might expect, for example around central bank meetings or big economic data events. This can also be a useful trick to identify the timezone of data, particularly where there are less defined closing times (eg. developed market FX markets which are basically open 24/5 from Sunday evening to 5pm NY on Friday). It might also be one way of working out if there’s excessive latency in the feed (obviously, need to make sure our events are properly timestamped to determine this). We know when regular scheduled events are, we can attempt to align the data where the spikes coincide. Whilst it would be nice if everyone just recorded price data in UTC time, I’ve often ended up getting data timestamped in many other timezones like Chicago or New York time. I always save down any data on disk in UTC time for consistency.
If there are massive vol spikes at times we don’t really expect, we might want to take a further look at the spot data. I’ve noticed this sometimes, when spot data just goes back and forth between two values for a couple of hours, presumably as a result of data recording error.
Is vol too low at times for prices?
Vol can also help us identify if the price action isn’t moving much. This can help us identify those periods where data could be missing, and flag them. For example, was the price not moving because the market was quiet, or because the value just got “filled down” by a data provider.
Remove price data outside expected trading hours
An obvious thing to do is to remove data outside trading hours. For FX, for example, we’d remove any data recorded on Saturdays. Sometimes there should be no trading on holidays, but in FX, you still usually get trading even on holidays. The exception is a day such as New Year’s Day. We should be careful about data recorded on less liquid days/holidays in any case. On less liquid times, it could be there are far fewer market makers submitting quotes, so a anomalous one could end up actually impacting an aggregated feed.
What about data which isn’t price?
We are likely to need to use other techniques. Take for example news data, which I’ve done a lot of analysis work on. Cleaning the dataset can also be time consuming. If we are already using a structured news source, eg. Bloomberg News, the text will be a lot nicer to deal with. If we are webscraping, we need to be careful to remove HTML tags etc. and to extract the body text which we interested in (rather than headers/footers).
Don’t reinvent the wheel
At least once you’ve cleaned your dataset, you want need to repeat the exercise in future, looking at the same historical sample.