First steps when exploring data

It’s exciting travelling to a new place and exploring it. No matter how prepared you might be, it’s always different when your actually there. Photos are useful for points of reference, as might travel guides, yet they always miss parts of the travel experience. Revisiting a place after many years, adds to the mix memories, and trying to understand how a place has changed.


Quants are basically explorers too, albeit through the world of data (yes, that is perhaps the cheesiest line I’ve written in a blog before). There are several steps to building a model. The very first step involves identifying a question we seek to answer, a hypothesis and also trying to allocate a budget to the exercise, in terms of time and also money (whether that is data costs, people etc.), as well trying to ascertain the priority. No matter where you work, you are likely to have many more questions, than time/budget to answer them all!


Our question could be to forecast a particular market variable. It could be to create a trading strategy around a specific idea (eg. trend). Maybe it might trying to explain something, such as identifying different regimes. Below I’ve outlined a few of the initial steps that would be involved in developing a model. Obviously, this is somewhat an oversimplification, and it can end up being more complicated process. In particular, you will likely find some iterative element here, where you might need to return to earlier steps.


Collect the datasets – We then need to brainstorm the types of data that could be useful to answer our question. This will require an element of domain knowledge for us to narrow down our list of datasets to a workable list. We can get ideas from our own experience, discussing with our team, reading research papers which addressed similar topics and so on. We can also examine alternative datasets which might be useful. In a fund this will often involve a close collaboration between end users (eg. portfolio managers), quants/data scientists and data strategists.


Domain vs data driven approach – Having an understanding of the market can help us identify the most likely useful datasets to begin our search. It can also help us eliminate what we believe might be spurious relationships at a later stage of the model building process to have a more directed approach to selected features to investigate. By contrast a more data driven approach, would be to say that we include as much data as possible. In areas such as computer vision, techniques such as deep neural networks has been more successful than using hand crafted/intuitive features. If we are using lower frequency financial data, we have a lot fewer points and there is also the difficulty that the data is not stationary (unlike in computer vision). There is also often a need in finance for models to be explainable.


Transforming our dataset – Once we have our datasets in place, we need to consider whether we need to do any transformations of our dataset to create specific variables to include in our model. This can involve applying transformations to make the dataset stationary, through the computation of returns etc. We also need to be careful to understand the timestamping our datasets.


Exploratory data analysis – This involves trying to understand your data from a preliminary basis:


  • understanding the quality of the data (and cleaning it), for example, how often is it updated, how volatile it is, what are the outliers etc.?
  • plotting the data can be very important and choosing the right visualisation
  • quantifying relationships between your x/independent variables and whatever you are trying to estimate or forecast (ie. your y/dependent variable), for example through correlation, and if we have a relatively small number of variables, whether these relationships would fit with our domain knowledge/are explainable (obviously if we have a massive amount of variables, this approach becomes trickier!)


Once we’ve completed the initial steps above, the later steps will involve creating a model, whether it is a trading model, regression model etc. using variables we have identified from the EDA as being of interest and so on. We can sometimes iterate back to the earlier steps if we think it will be productive. We will likely try a number of different model types across multiple parameters to see how well they fit to the data, and we evaluate them using a metric. This metrics will vary depending on the nature of our model (eg. for a trading model it might returns, for a regression model, it might be R squared etc.). In all these research steps we need to be careful to separate out our dataset, in particular, retaining at test set, for final evaluation of our models to ensure that we haven’t overfit our models.


Our focus in this article, has been on the initial steps of building a model, namely gathering the data and doing an exploratory data analysis. We have noted that some element of domain knowledge is important to help direct our initial search.