When I hear S3, I still think of the Audi S and RS models, those high performance variants and Audi’s versions of the BMW’s M series. I’ve got to admit, there is something a bit more special about a BMW M5 compared to an RS6, but I’ve never driven either to confirm that (any comments from folks that have indeed driven both). However, I’m probably in the minority, these days say S3, and most people will probably immediately think of S3 on Amazon Web Services.
What exactly is AWS S3? It’s basically storage in the cloud, which you can make accessible from anywhere on the web, and managed by AWS. You don’t think to worry about scaling your storage or backing up, AWS can do that all for you. Throw whatever data at it, and it sticks! Data is stored in S3 buckets which are similar to the concept of a folder on a local disk. Other cloud providers have their own equivalents with Google Cloud Storage (GCS) and Azure Blob. Like with most things on AWS (or indeed any cloud provider), it’s probably not surprising that S3 is not the only storage service available. You’ll find a full list of AWS storage products https://aws.amazon.com/products/storage/, which are at different price levels and performance too. It is important to use the right services for storage which have the right performance cost balance for your specific use cases. This will depend on what performance/latency is required as well as how frequency it will be accessed.
The cost of S3 depends upon factors like:
- how much data you store?
- which service you use (are you using S3 Standard Storage for example, or S3 Infrequent Access Storage)?
- how many requests you make for the data (PUT/GET etc.)?
- which region is it in?
- how much data you transfer from S3 out to the internet?
An article on cloudhealthtech.com goes through the various ins and outs of the pricing at which I found quite useful. They note that the storage cost for Standard S3 (Nov 2020) is around 0.021 to 0.026 USD per month per GB. So for 1 TB that’s around 21-26 USD per month, roughly under 300 USD per year. This excludes any of the various request costs for example, which you need to take into account. The actual cost of a hardware is a lot cheaper (a quick browse of 1 TB drives online, suggested a cost of around 50 USD), but if we manage our own hardware, we need to take into time cost of managing it, including stuff like backup, convenience of access, replacing hard drives etc. The cost of losing data is likely to be significant if we choose to host our data locally and we haven’t got sufficient redundancy.
Is S3 easy to use? It is fairly easy to use once it is all setup, and you have all the right permissions. This will involve creating a bucket when you log in to AWS’s console. If you want to use it in Python, there’s a few steps you’ll want to go through so it’s accessible. In my findatapy library, I’ve made it relatively straightforward to download market data and store it to S3 as Parquet files, so that it’s as easy as accessing data from a local drive. As with any data process, you need to make sure that you following the data licences associated with data, when using and storing it, and in particular with how you give different users access permissions.
As you might expect, there is some latency associated with using S3 when accessing it locally, but I expect that the main use case is to use S3 whilst working within AWS itself and also in the same region. However, with findatapy, there is also handy feature to cache market data requests using Redis, which helps to speed it up, so if you make multiple requests for the same data it checks the Redis cache first. If you’re interested in precisely how to use findatapy together with S3 to store market data in a nice organised way with the ability to have ticker mappings, I’ve put together a Jupyter notebook on my finmarketpy GitHub, and I’ve given examples on storing Dukascopy tick data and Quandl market data. There are also some other Python libraries worth looking at if you are using AWS with Python such as AWS Data Wranger and also Smart Open.
There are obviously good reasons to store market data locally. However, for redundancy and usage on the cloud, tools like S3 are especially useful. It allows us to scale our data quite easily up and down, without having to worry about managing hard drives etc. Furthermore, it is possible to use S3 in a way very similar to the way you would access market data locally, using for example a tool like findatapy.