What’s the most important thing we need to have in order to follow financial markets? Burgers… ok, I made that up, the real answer is market data! Whilst there are many other datasets including those drawn from alternative data, which are becoming increasingly important, having a handle on the asset prices is still key. When it comes to daily data, storing it is pretty easy. We can use an SQL database, or even just in flat files like CSVs and Parquet files.
However, when you get into the realm of high frequency tick data, we could still use Parquet files, but another solution is to use a time series database. Of these perhaps the most well known is kdb+/q. It is very fast and I would take a guess that most sell side eTrading desks use kdb+/q (and those that don’t, probably want to buy a licence). Aside from that there are also some open source alternatives like InfluxDB or Arctic (sitting on MongoDB) or ClickHouse. There’s also a new time series database by Arthur Whitney called Shakti, who was also the creator of kdb+/q and Timescale which sits on top of Postgres. In tcapy, my open source library for transaction cost analysis, I’ve written wrappers to allow it to read/write tick data to kdb+/q, InfluxDB and Arctic (and hopefully will add more). I’m also thinking of adding a wrapper for ClickHouse, in tcapy and also Timestream if there’s enough interest from folks.
AWS Timestream – serverless time series database
So what’s AWS Timestream? It was announced a long time ago, but it was finally released publicly a few weeks ago. It’s serverless and fully managed and it’s designed to scale up and down your requirements. You only pay for data that you’re using up and when you query. It should be noted that other databases are also available on a managed basis, such as MongoDB on their Atlas service, which I’ve also used in the past and this functionality is supported by tcapy (just need to type in the Atlas URL). kdb+/q is also available on AWS on a managed basis through AWS Marketplace.
Timestream, AWS’s new time series database, is designed for recording time series generated by IoT devices. It also has an SQL like language for querying it. However, given that financial data is also largely time series, I thought it would be worth coding up a simple example to show it can be used to store and query financial market tick data, instead of AWS’s IoT use case. I’ve written up a full Jupyter notebook with Python code here for this example – but I’ll tried to summarise the notebook (excluding any code) and talk more broadly about Timestream here.
Memory for newer data, magnetic storage for older data
You can store data in memory and magnetic storage on Timestream, and typically you’d specify the last day of data to be stored in the memory (which is optimised for rapid point-in-queries) and older data to be put on magnetic storage (optimized for fast analytical queries). To the user however, this all appears as one continuous data store when you query it. Hence, you don’t need to manage the movement of data from memory to the magnetic storage. There’s also going to be an SSD storage coming soon.
From the perspective of a quant, you’d likely use a long data history for backtesting a trading strategy, whilst the shorter term data might be used for supporting live trading.
How much does it cost?
AWS charges you on usage (current pricing here) of Timestream, rather than a flat licence fee or an hourly charge for renting a server. At time of writing (October 2020), if you’re using the AWS Europe, Ireland region, the pricing is broken down as following:
- 1 million writes of 1KB size costs $0.5654
- Memory store costs $0.0407 per hour per GB
- Magnetic store costs $0.0339 per month per GB (cheaper than memory!)
- Queries cost $0.011308 per GB of data scanned
Cost on a simple example…
Let’s do a simple example to illustrate the costs. I’ve assumed units of 1000 here for GB to TB etc. for simplicity, rather than 1024 etc. Let’s say you write 1 GB a day of data, and you write this in 1KB chunks. Writing that day to Timestream would cost $0.5654, so $142.48 per year. Storing a day of this in memory would cost $0.0407 per hour, over a year $356.53 (including weekends). If we assume we have roughly a 1TB in magnetic storage, that would cost $0.0339 per GB month, so $406.80 per year. Obviously, this is a bit of a simplification, because the amount of data we’d have would continually be increasing.
If we want to do a backtest on 1TB of data, given 1GB of query scanning costs $0.011308, that would cost $11.31. If we assume we do 50 such queries in the year, that would cost $565.4. If we also assume that on a daily working day basis we scan through 1 TB of data in queries, that’s $2849.62.
If we add that all up, it’s $4320.83 per year for this specific use case. Obviously this use case is going to be different to yours (and perhaps mine isn’t that realistic?), so you would need to calculate the cost yourself, depending on your expected usage and compare it to other cloud solutions/licence costs. If you were to run things locally, you’d have to factor in costs like hardware, support costs etc. There’s also quite a big convenience element with a managed services like Timestream being monitored and running by someone else!
I would love to hear any feedback about costs and how it might compare to other solutions!
Record, Measure and Dimensions
As with other databases, you have the concept of a Database and a Table. A Record consists of a single point, which has a Time associated with it, and a Measure, which is like a database field with a value and type. We could have a Measure to record the mid price. For attributes which don’t change much we have Dimensions, which are name/value combinations, such as the ticker name.
Recording old times
One thing to note is that you can’t write a Record with a Time outside of the memory writing period (or indeed a future point). What if you wanted to record old historical points though? One way to this would be to leave the Time field for the time at which the data was recorded in Timestream, and have another concept venue time, for the time at which the point was recorded at the exchange. Typically, tick data will have multiple timestamps like this, and it also enables us to measure latency between the exchange/venue and our own systems.
A trick to enable multiple fields
You can’t have multiple Measure fields in the same record (see here). However, in my example I worked out a simple way to do this in practice. Each Measure can also point to a Dimensions list. You could have multiple Measure fields, eg. bid, ask, mid etc. all pointing to the same venue time stored in the same Dimension. When querying the data you can stitch back the various Measure fields into a single record for each venue time. At this stage, I’m not sure about the performance implications of doing all this though.
Python Timestream example with financial market tick data
In my Python Jupyter notebook example (view here on my GitHub teaching page), which is based heavily on a GitHub AWS example, I downloaded tick data from Dukascopy using my findatapy. I then showed how we can use AWS services on our machine, by first installing the AWS CLI and setting all the credentials.
I then showed how to use boto3, which AWS’s SDK for accessing AWS from Python to create a session, and also to create a Timestream database and table. boto3 makes it fairly easy to interact with Timestream from Python. One of the main complications was converting from a Pandas DataFrame into records/dictionary like format for boto3. If I had more time I’d probably need to spend more time on optimization of my conversion code which was pretty slow, owing to a for loop.
In my Jupyter notebook, I showed how to dump these records into Timestream and later how to query it to read it back into Python. One thing I would say is that it was still a bit trickier than doing the same thing with other tools like InfluxDB, and it would likely have been easier if you could specify multiple Measure fields in the same record. It could be the case that I missed some additional helper libraries to make the process easier (or if there aren’t any, I’m sure AWS will likely write them). I did find a AWS Data Wrangler library specifically for using Pandas with AWS, which I didn’t have time to check.
Conclusion and next steps
I only tried using AWS Timestream for a pretty basic example for reading/writing tick data, without many rows of data. However, the example was pretty easy to do. Also given it’s so new, I expect we’ll see improvements to the functionality as time passes. I might also have missed some obvious ways of doing things quicker. I’m sure the folks at AWS must have some tips on how to optimize queries, writes etc. at Timestream!
As a next step, it is likely that I’d need to try it with a bigger dataset to take advantage of Timestream’s scalability. I also was using Python on my local machine. Next time, I’d likely start an EC2 instance in the same AWS availability zone as the Timestream database I was using to reduce latency issues. There are also some costs for moving data out of an AWS region.
Furthermore, I’d need to spend more time on the code which writes to disk, as well as the reads, converting it into parallel code. Cutting up these requests into parallel small chunks, is often much quicker then trying to fetch/write very large chunks. This is something I’ve done in my tcapy library when interacting with tick data, and it speeded up things considerably.
I think having a Python wrapper for directly quickly writing/reading Pandas dataframes with Timestream would probably encourage folks to use it (if it doesn’t already exist?). If enough people are interested and would like to sponsor the idea, I’d be happy to spend time on writing a nice Python wrapper to Timestream and include it in tcapy.