Hive Tables and What’s Next for Modern Data Platforms

12 min read

In our introductory post we discussed the typical structure and usual components of a modern data platform.

A very common component of any Data Lake and Data Warehouse implementation is what we often call the “Cold Storage” tier. This is where, or rather how, the vast majority of data is persisted in a cheaper storage solution, ideally in a way that still allows it to be queried. Putting data in “cold storage” assists in reducing the total costs of the data platform and trading off query speed with significantly lower costs.

In this post we are going to focus on that layer of cold storage, understand how it usually works, and try to map the gaps that exist today and offer our wishlist for more modern implementations which will serve the future versions of data platforms.

The best place to start is with the convention most large data platforms in use today are using, often referred to as “Hive Tables”, or the Hive Table Format.

The Hive Table Format

Apache Hive is a SQL engine to query Big Data. It was originally built to translate SQL statements into Hadoop MapReduce jobs, and contin...

SQL on Kafka with Presto (Video)

Presto is a state of the art Distributed SQL Query Engine for BigData, enabling efficient querying on cold data and various data sources. With extended SQL language and features like geospatial queries, joins between different data sources (SQL to join data from HDFS, Elasticsearch, and Kafka anyone...

Showing 10 posts out of 135 total, page 1

Next page