| Description: |
Textbook in PDF format
The “lakehouse” data architecture is a powerful way to combine the flexibility of data lakes with the management features of data warehouses. The open source Apache Iceberg framework delivers the scalability, reliability, and performance you want from a lakehouse without the expense and vendor lock-in of platforms like Snowflake, BigQuery, and Redshift.
Apache Iceberg is an open source table format perfect for massive analytic datasets. Iceberg enables ACID transactions, schema evolution, and high-performance queries on data lakes using multiple compute engines like Spark, Trino, Flink, Presto, and Hive. An Iceberg data lakehouse enables fast, reliable analytics at scale while retaining the observability you need for compliance audits, governance, and provable data security.
Data warehouses delivered performance and governance but locked data behind high costs, proprietary formats, and rigid schemas that made change slow. Data lakes reduced storage costs and improved flexibility but sacrificed reliability, performance, and consistency, turning analytics into an engineering project. Hybrid approaches tried to bridge the gap but often added complexity, duplication, and operational overhead. The result was a fragmented data landscape where teams spent more time moving, copying, and fixing data than using it. We’ll see how these shortcomings led to new ways to define and manage datasets on the data lake—approaches that combine the key benefits of both data warehouses and data lakes to create the data lakehouse.
One of the newer approaches is Apache Iceberg. It’s an open table format that lets you treat groups of files on distributed storage systems like traditional database tables, so the data lake can truly be the center of your analytics platform. With Iceberg, multiple tools can efficiently access analytics datasets stored in your data lake. This open access makes it easier for teams to work together and cuts down on unnecessary extract, transform, and load (ETL) work and data replication by keeping a single, canonical copy of the data.
The Apache Iceberg lakehouse is a modular, scalable, and cost-effective architecture that combines the best aspects of data lakes and warehouses while staying open and flexible. Why are companies like Netflix, Apple, Dremio, AWS, Snowflake, and Databricks using Iceberg and building tools around it? One reason is that Iceberg offers a community-led standard format for storing analytical datasets. It works across a wide range of tools while still providing ACID (atomicity, consistency, isolation, durability) guarantees and the performance you expect from proprietary data warehouse systems.
This book will show you how Iceberg works and how you can make the right architectural choices for an Iceberg lakehouse to meet your use cases. We’ll cover data lakehouses in general and Apache Iceberg in particular, with hands-on exercises you can run locally. You’ll ingest data from databases into your lakehouse and build business intelligence dashboards on top of it. Along the way, I’ll help you assess your data platform needs and explore the ecosystem around each component so you can understand your options for building the ideal platform.
In this book, data guru Alex Merced shows you:
How to create a modular, scalable Iceberg lakehouse architecture
Where Spark, Flink, Dremio, Polaris fit into your design
Reliable batch and streaming ingestion pipelines
Strategies for governance, security, and performance at scale
About the book:
Architecting an Apache Iceberg Data Lakehouse teaches you to design a complete data platform with Iceberg. The book carefully guides you through the architecture of your platform—from storage to governance. Each layer is fully illustrated and includes hands-on examples that connect theory with practical implementation. You’ll ingest sales and marketing data from PostgreSQL into Iceberg tables using Apache Spark, build interactive dashboards in Apache Superset, design and compare ingestion pipelines, and much more. Author Alex Merced’s experienced guidance helps you understand the important tradeoff decisions you’ll need to make in real-world implementations. You’ll soon have a scalable and maintainable data platform that can handle petabytes of data!
About the reader:
This book is for data architects, platform engineers, and senior data professionals responsible for modernizing data infrastructure or designing new analytical platforms. You should be familiar with the general concepts of data lakes, warehouses, and processing tools such as Apache Spark or Flink. No prior experience with Apache Iceberg is required, but familiarity with cloud storage, distributed systems, and SQL will help you get the most out of the material.
About the technology
Apache Iceberg is an open data format that lets data lake files work like database tables. It helps turn a data lake into a more reliable and capable lakehouse.
What's inside
Create a modular, scalable Iceberg lakehouse architecture
Fit Spark, Flink, Dremio, Polaris and more into your design
Batch and streaming ingestion pipelines
Governance, security, and performance at scale
About the reader
For data architects familiar with the basics of a data lakehouse.
About the author
Alex Merced is Head of Developer Relations at Dremio. He shares his expertise through videos, podcasts, and articles, and leads the DataLakehouseHub.com community.
Table of Contents
Part 1
The world of the data lakehouse
Apache Iceberg and the lakehouse
Hands-on with Apache Iceberg
Part 2
Preparing for your move to Apache Iceberg
Selecting the storage layer
Architecting the ingestion layer
Implementing the catalog layer
Designing the federation layer
Understanding the consumption layer
Part 3 Operating your Apache Iceberg lakehouse
Maintaining an Iceberg lakehouse
Operationalizing Apache Iceberg
A The metadata tables
B Python for Apache Iceberg
C The Apache Iceberg specification
|
Discussion