Whenever we eat something, understanding the ingredients which have gone into a dish is important. Is it organic? Is it fresh or frozen? Have the food and vegetables been grown in a sustainable way? What is the quality of the produce? There are many questions we might seek to ask.
When it come to financial markets, our raw ingredient is data. We might have very similar questions which come to mind. In The Book of Alternative Data, Alexander Denev and I, have a checklist of the various questions you might ask about a dataset if you’re purchasing it from a vendor (or trying to create your own dataset). Indeed, many buy side firms have their own checklists associated with data, and they can vary between firms.
Many of these questions are technical, such as asking about the quality of the data, the types of assets for which it is applicable, the costs etc.. However, just as many questions are related to the legal aspects of the data. What might these legal questions be? These are questions that are important for both creators/sellers of alternative datasets and those who are purchasing them.
First off, what are the sources of raw data and does the data vendor have licences for these? Some vendors maybe somewhat reticent to say precisely which raw datasets they are using in particular models, given that the precise composition of the models they use to create datasets is likely to be proprietary. However, data vendors still need to give some insights into the general types of data they are using and in particular the general methods of collection they are using. If a data vendor refuses to answer any questions about the origins of the raw data in their model, it can be impossible for a client to ascertain whether the raw data is compliant.
For example, with web scraping, are the data vendors only scraping public webpages, and avoiding those which require a login or paywall? There have been a number of legal cases recently around webscraping most notably the case between LinkedIn and hiQ, where LinkedIn said that hiQ did not have the right to scrape public LinkedIn pages. At least so far, most of the court decisions have gone in favour of hiQ. If the raw data is purchased from a third party, what does the licence entail? In many cases licences might prohibit the redistribution of the raw data itself.
When the data itself is not public, and contains PII (personally identifiable information), has the data vendor secured consent to collect the data for individuals? Furthermore, has the dataset been sufficiently blurred so that personal details cannot be reverse engineered easily by joining with other datasets? Recently, there was an SEC case against an alternative data provider, who did not aggregate data using a statistical model, despite claiming to do so (Matt Levine at Bloomberg explain the case between SEC and App Annie far more clearer than me here, and I strongly recommend reading it).
I am not a lawyer, and this is a rapidly evolving area of the law. Hence, I strongly recommend consulting a lawyer if you want to know more about the legal questions associated with alternative data. Far too often, people are focused on whether data is “useful” in their process. However, even before we ingest a dataset, we need to check whether it has been collected in a compliant way.