The one thing we really want to know when making trading decisions, is the future! Alas, we don’t have that information! All we do have is information up until the point we actually want to trade, and have to somehow use that to make a prediction about the future price. Ok, this is blatantly obvious stuff, we all know. The difficulty however, is that when we backtest a trading strategy on a historical dataset, data about the “future” (ie. points in the past after your trading points) can sometimes slip into your backtest by mistake. It is important that our dataset is point-in-time consistent, so it’s not polluted by the future. Otherwise, we are likely to be artificially introducing hindsight bias into our backtesting process. We might end up trading something which looks good on paper, but fairs poorly in real life trading.
Let’s start by giving an example with price data. Say we are backtesting a daily trading strategy. Our data vendor gives us daily snapshots for FX. Let’s say we are generating our trading signal off the back of price data too. We are making the assumption that we can instantly generate a signal at the close and then trade at the same time. In practice there is likely to be a small lag between generating the signal and actually sending an order. We could argue that maybe this is not a big deal (at least we would hope that the alpha of a daily/long term signal is not affected by such a small gap). Now, say we are doing a strategy with high frequency data, where a strategy has a very quick alpha decay. Here the assumption about instant execution might be more problematic, depending on the strategy. Again there will be a small difference between the timestamp that a data vendor records and the subsequent timestamp for when we receive that data and subsequently are able to execute it. We can of course try to do some sort of sensitivity analysis to try to understand the relationship in this time lag and the historical returns of a strategy.
More subtle problems can occur when we are using other datasets to generate our trading signal to trade our asset (for which we still need price data). Here a particular problem is that the timestamps of other datasets might not be point-in-time. Let’s take for example economic data such as non-farm payrolls. If we have data for January 2018, sometimes we might see a timestamp like 31 Jan 2018. However, we can’t trade on this data till it’s released, which is generally on the first Friday of the next month (eg. 2 Feb 2018) and this will be first estimate of that economic data release. The release date/time timestamp is crucial for trading purposes, and we need to have this timestamp. If we look at this dataset several months later, it is likely that the value which is for “31 Jan 2018” has been revised many months later too. This is something we need to be very careful of with all economic data. This applies to pretty much any dataset (eg. web sourced data, when was the actual date/time when a web page was parsed? For news, when was this article first written? Is it an updated version of the article?)
For data vendors having timestamps for when data is snapped is crucial, not just timestamps for which period it relates to, in particular when there is a big lag time between the two. Without having these additional timestamps, we can end up unwittingly introducing hindsight bias into our backtesting process. It thus makes data less valuable from a trading perspective.