AWS Athena vs your own Presto cluster on AWS
I just published Easily deploying Presto on AWS with Terraform, but ignored a very important question: AWS offers Athena for SQL over S3, which is essentially a Presto deployment managed by AWS. Why not just use AWS Athena instead of going through the trouble of deploying your own cluster?
I probably should have addressed this in the original blog post, but since I didn't - let's do the complete reasoning for when you should (and shouldn't) consider your own Presto cluster.
Cost per query
The main reason to prefer your own Presto cluster is significant cost savings.
With Athena cost is per query with a price of $5 per TB scanned. If your organization has a lot of data but only few queries per day, then Athena is definitely the economic choice. However, if your company is data-driven and has a team of analysis and BI users then it's a completely different story. Their dashboards and queries will be in the dozens if not hundreds or thousands per day, and possibly scanning many TBs of data each.
By deploying Presto yourself you can drastically lower the cost-per-query. Here are the economics of SQL over S3 with Presto:
You pay a constant fee for the compute instances you are running (EC2 instances cost). The more machines you run and the bigger they are - the higher the fee, yes. But Presto is very efficient and if your data is correcly stored, a few commodity machines will do a great job.
If you are running your Presto cluster on the same region as your S3 bucket, and within one AZ, then there is no network or data transfer costs at all. We highly recommend this setup - multi-AZ and multi-region are pretty much irrelevant for MPP engines like Presto.
S3 reads will be billed by API calls, but those are priced a fraction of a cent per 10k calls. Presto might leverage S3 Select in queries, which costs a bit more but will boost queries performance and reduce the compute units required.
The compute costs can be further optimized by using spot instances for worker nodes, and completely shutting them down off-hours (where applicable). Presto can deal with a lost worker node - which might slow down some queries but spot instances come at a great discount.
Cost per query can be order of magnitude cheaper, and even several orders of magnitue cheaper, with your own Presto cluster. This will be mostly true if you run queries extensively and each query scans significant amounts of data.
Obviously, this doesn't account S3 Storage costs - but those would be the same also for the data Athena queries.
Either way, make sure you store your data with proper partitions, and use a columnar file format like Parquet or ORC.
Not as high DevOps costs
Usually the argument pro-SaaS solutions is DevOps costs. That is a true argument, but as always - it depends how much the SaaS will really cost compared to the DevOps costs you will evantually pay.
With out presto-cloud-deploy Terraform solution, it's really easy to set-up your Presto cluster on AWS. After the initial setup and tuning, which could take several days at most, you are done. Except from occasional massages to the cluster, you will hardly even remember you deployed a cluster of machines.
Better visibility on query performance
The Presto web UI is a great query monitoring tool, showing you all executed (and failed) queries, along with performance statistics which let you fine-tune your cluster for faster and cheaper queries. Athena doesn't give you anything even remotely close to that.
Presto has an impressive set of Connectors out of the box, with some connectors you can find on the net and plug-in to your Presto deployment.
Those connectors let you query not just data on S3 and MySQL instances (via JDBC), but also non-relational datastores like MongoDB, Redis, Elasticsearch and even Kafka (KSQL anyone?).
If you want to execute queries against those stores is when you want your own Presto cluster. Presto also lets you use multiple data sources in one query (e.g. join S3 data with MySQL data), and that is a killer feature you wouldn't want to miss.