I end up tweeting a lot. Possibly, far too much of what I tweet is random, about burgers and so on, albeit with a modicum of tweets about markets and Python. Twitter inevitably acts like some sponge, absorbing your attention, which can often be a bad thing, but can actually also be a good thing. A lot of what I have learnt about markets and coding in recent years has been a result of seeing tweets by smart people I follow. It can often be the case that a short tweet I might read on Twitter can really me think about the topic more broadly. I follow Thomas Wiecki @twiecki on Twitter, who is the VP of Data Science at Quantopian, the crowdsourced hedge fund. He recently tweeted the following:
“Seems like companies start to realize hiring data scientists without data engineers does not work. Thus, focus is rightly being placed on data engineering first. Once data pipelines are properly built out, it will get even more interesting for data scientists. HT @Springcoil”
This was definitely one those of tweets which got me thinking! Wherever you turn, it’s always data science which is fashionable. Python has become very popular as a tool for solving data science problems. It has lots of great libraries such as pandas for working with time series, and machine learning tools such as scikit-learn, as well as TensorFlow. However, what you tend to hear much less of is the problem of data engineering, preparing the all important data pipeline and software engineering more broadly, which is just as important (perhaps even more so) as data science. No data means no data science!
Let’s take for example something which I’ve been working on for the past 18 months, developing a transaction cost analysis library in Python for currency markets. We are essentially trying to solve a data science problem, where we have as inputs our own trade data and market data. From those datasets we want to make inferences about the costs of our trades. The very simplest metric is slippage, calculating the difference between our executed price and the benchmark market price. So maybe we just need a simple Python script to do this? Obviously, some metrics can become more involved. However, this one looks mathematically simple. If our datasets are very small and nicely regular, it seems like an easy problem to solve. In practice, things can get complicated very quickly (as I learnt!) to solve what had originally been a very simple problem. These reasons include the fact that:
- tick data is very high frequency and irregularly spaced
- trade data is also irregularly spaced
- timestamps may not be unique
- slippage is not the only metric, we’ll want to calculate
- we are likely to have multiple ways to specify benchmark price (mid price, TWAP, VWAP, arrival price etc.)
- different users will want to access the TCA computation in different ways (maybe via a GUI, Excel, API etc.)
Suddenly, we need to think about how we store very large amounts of data. Furthermore, the volume of the market data, means that the whole process can become very slow, not only to calculate our metrics, but also loading the data. We need to think of ways to speed this stuff up, understanding ways we can cache data in memory and also ways of distributing the computation. Potentially, we need to think about ways of deploying our software to the cloud, if we want it to scale.
We also need our code to be flexible so we are not restricted to purely calculating slippage, it should be possible to easily add new metrics and benchmarks. How we present the data to users needs to be flexible, creating a web GUI is very different to the back end computation.
What was originally a data science problem has basically morphed into a problem that requires data engineering and careful software engineering to ensure our code is modular and easy to maintain (and no, CTRL-C and CTRL-V does not constitute code reuse). Engineering skills are very different from data science skills. At least in my TCA example, the engineering part of the project was extremely time consuming, probably more so than any of the data science bits.
So next time you try to solve a data science problem, always remember that ultimately the way to get a robust solution, which will also scale, will require significant time and effort in terms software engineering skills. Software engineering should not be reduced to a last minute consideration in data science, but should be an ever present part of the solution.