In our introductory post we discussed the typical structure and usual components of a modern data platform.
A very common component of any Data Lake and Data Warehouse implementation is what we often call the “Cold Storage” tier. This is where, or rather how, the vast majority of data is persisted in a cheaper storage solution, ideally in a way that still allows it to be queried. Putting data in “cold storage” assists in reducing the total costs of the data platform and trading off query speed with significantly lower costs.
In this post we are going to focus on that layer of cold storage, understand how it usually works, and try to map the gaps that exist today and offer our wishlist for more modern implementations which will serve the future versions of data platforms.
The best place to start is with the convention most large data platforms in use today are using, often referred to as “Hive Tables”, or the Hive Table Format.
The Hive Table Format
Apache Hive is a SQL engine to query Big Data. It was originally built to translate SQL statements into Hadoop MapReduce jobs, and contin...