Quants like data, in the same way I like burgers. Quants like to consume data to come up with ways of forecasting the market. The most important dataset is usually market data. However, there are all sorts of other interesting datasets that can be used to help forecast those market prices. In recent years, the area of alternative data has exploded. There are many datasets now available from alternative data vendors. Obviously a quant wants a dataset which can make them money! But before that stage, the question is, if you’re a quant, which datasets are worth investigating for alpha? We need a way of pruning down our massive list of alternative datasets before we go anywhere near a Python script! There simply isn’t enough time to investigate every dataset for alpha thoroughly (we can do fairly standardised tests such as finding the correlation with price action etc. but these aren’t necessarily going to tell us if there’s a specific profitable trading rule we can apply). For alternative data vendors, what do they need to do, to make their datasets usable by quants? Hence, there are many questions we need to ask about a dataset first, we even want to investigate whether a dataset can provide us any insight into forecasting markets. Below we give a list of questions a quant might want to ask about a dataset, that they might want to use to find alpha.
Historical data for a dataset
One of the most important is the length of the dataset. If we want our dataset to forecast prices on a daily basis, and we only have a month of data, it isn’t really going to be that useful. One month is simply not sufficient to gauge the robustness of a dataset. There is a balance however. Ideally, we’d like several years of data to do a historical backtest. There is a balance however, because, if the dataset does turn out to be valuable, do we really want to wait 10 years till there’s enough historical data to use it? This is tricky!
Timeliness and point-in-time
When is the data available? If it takes many weeks on end, with a very large lag, it’s likely to be more difficult to trade off it. Furthermore, is the dataset properly timestamped, so we know the time that it is released to users, not just when it is collected. This is crucial for trading purposes. Furthermore, the historical dataset should not continually be changed in the future, otherwise it makes it difficult to have any faith in a backtest. To state the obvious: we cannot trade on data which is released in the future (or changed in the future).
How many people use it?
If the capacity of strategies based on a dataset, we might find it advantagous if fewer people use it (if everyone is chasing for a very tiny door, well the results aren’t good). Some data vendors will therefore limit the number of clients who can subscribe to specific datasets, so there isn’t a lot of alpha decay. I would argue though that having very few people using a dataset is not going to be that important if there are many strategies we can use on a dataset. After all, there are strategies which use common datasets, and they are still profitable. There is also a question of whether a dataset should be fully exclusive? Over time, what once was a very unusual alternative dataset, can often become mainstream. We should note that just because a dataset is “unusual” doesn’t mean it will deliver alpha (even if it sounds “cool”)!
What is the specific value provided by the data vendor?
There are several ways a data vendor can provide value for a specific dataset. The first is the raw data itself. Some data might be relatively commoditised, hence, we can get it from multiple vendors (which obviously drives the cost lower). However, in some cases, there might be very few vendors which offer that data, in which case even without any additional structuring the raw data can be valuable alone. The second way a data vendor can add value is through cleaning and structuring a dataset. No quant likes to spend ages cleaning a dataset. If a vendor can do a lot of this job, quants will be grateful. Data vendors can also structure the data into more easily accessible forms, which make it easier for a quant to process. You could argue that some large quant funds, might prefer to do all their own structuring, but this obviously requires a decent amount of resources.
How was the data collected? Is it anonymised?
This is a key question, given data privacy issues, funds should not be given data which in its present form is protected. For a fund, they want to aggregate signals from data. Anonymised data is therefore fine. However, does the data vendor actually do a good job of making sure data is properly anonymised before giving it to the fund/trading firm who use it?
Has the data vendor provided us with clues of how to use this dataset with markets?
Some quants might disagree with this. However, if a data vendor has done research to show how data can be used to trade markets, it at least provides some insights we can use as starters. We don’t necessarily need to do exactly what they suggest, but I would argue that it would add to my confidence that there is something useful in a dataset. (And obviously if you are data vendor, you can commission Cuemacro to do such a research paper on your dataset, to show how it can be used to trade markets).
What is the price of the alternative dataset?
Ok, this is an obvious point, but the price of an alternative dataset is related to the value (and also how widespread it is, to some extent). If it’s a very expensive dataset, then it will need to provide *a lot* of additional alpha benefit to justify a purchase. Everyone thinks they have a very valuable dataset, obviously – not all quants would agree with that notion, and it requires extensive testing to see whether the value is justified.
Only now, can we test it!
It’s only once we’ve answered the various questions above (and much more), that we might to actually backtest signals on a dataset, to see whether it can help us forecast markets. In particular, our task is made easier, if we can come up with intuitive ideas to check on it. This can be easier with some datasets than others. Having a rationale helps us to reduce the chance of datamining to much. With some datasets, they might simply be too large and complex, for us to come up with a specific rationale, and potentially we might want to use machine learning to infer relationships. The downside of this is that our signal can become a bit of a black box. Or we can try to do a mix, using machine learning to structure the data into something simpler, and then using more traditional techniques. This is the approach which can be used with something like news data.
All that remains is for me to say, best of luck in your search for alpha in alternative datasets (and have a good burger whilst you’re on your search too!)