Introduction to Apache Hudi

11 min read

Apache Hudi is a data Lake technology that has been in use since 2016. Originally built by Uber, Hudi is the first of the three data platforms that we’re going to examine in detail. The series has in-depth reviews of Delta Lake and Apache Iceberg, and will end with a comparison between those three popular technologies.

This is the third post in our series, Architectures Of A Modern Data Platform

In this post, we are going to provide an overview of Apache Hudi, before diving in and examining its key strengths and how they match up against the requirements we set out in our previous article.

A short history of Apache Hudi

Hudi was originally developed at Uber in the early 2010s and went into production in 2016. Uber had a requirement for a cloud-based data lake to power and manage the 100PB of data that underpin key business functions for trips, riders, and customers. Uber’s data engineering team had a specific need to feed changes from relational SQL databases via either change data capture (CDC) or binlogs into a data lake.

Uber submitted Hudi to the Apache software foundation in 2...

Pulse for Elasticsearch and OpenSearch - Product Updates January 2023

5 min read

Let’s cut to the chase. Every Elasticsearch and OpenSearch user and administrator, whether on a managed platform or self-hosted, knows this feeling - endlessly hoping the cluster keeps up and doesn’t crash, and dreading the on-call alert in the middle of the night that demands action - which is ofte...

SQL on Kafka with Presto (Video)

Presto is a state of the art Distributed SQL Query Engine for BigData, enabling efficient querying on cold data and various data sources. With extended SQL language and features like geospatial queries, joins between different data sources (SQL to join data from HDFS, Elasticsearch, and Kafka anyone...

Showing 10 posts out of 137 total, page 1

Next page