by Bill Zhang Posted in Technical | October 10, 2024 3 min read The open data lakehouse is quickly becoming the standard architecture for unified multifunction analytics on large volumes of data.It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse.Open table formats are a key component of this architecture, as they provide many of the capabilities of traditional data warehousing directly on data lake storage, and Apache Iceberg is quickly becoming the standard format for vendors and customers alike.
Iceberg has many features that drastically reduce the work required to deliver a high-performance view of the data, but many of these features create overhead and require manual job execution to optimize for performance and costs.To make the data lakehouse even easier to manage, Cloudera is introducing Cloudera Lakehouse Optimizer, which intelligently automates Iceberg table maintenance so many of these jobs automatically run in the background.Let’s take a look at some of the features in Cloudera Lakehouse Optimizer, the benefits they provide, and the road ahead for this service.
Cloudera Lakehouse Optimizer Features Cloudera Lakehouse Optimizer runs automatic, policy-based Iceberg table optimization tasks based on user configurations and Iceberg table statistics.Automatic optimization jobs include: Compaction: Companies often ingest many small files, such as with micro batching or streaming ingestion, and reading multiple small files can negatively impact query performance.Compaction is a process that rewrites small files into larger ones to improve performance. Cloudera Lakehouse Optimizer autonomously determines the best time to automatically compact data files so users always have the best performance from their tables.
It also prioritizes the tables that must be optimized based on the usage patterns so we are only optimizing when there is real ROI.Table Cleanup: As tables grow, they often accumulate unused data files, manifest files, and snapshots that aren’t needed anymore.Users may want to perform table maintenance functions, like expiring snapshots, removing old metadata files, and deleting orphan files, to optimize storage utilization and improve performance.
Cloudera Lakehouse Optimizer will autonomously determine the best time to perform these maintenance tasks and ensure tables always utilize optimal storage.In addition to optimization and policy-based controls, Cloudera Lakehouse Optimizer features observability for optimization jobs, so data teams can see and understand how their policies are impacting the health and performance of their tables and storage.The Benefits Cloudera Lakehouse Optimizer provides several benefits for companies managing Iceberg tables: They experience lower Total Cost of Ownership (TCO) as a result of optimizing their storage footprint and reducing query runtimes.
They can deliver a high-performance of their data by reducing the number of files that need to be read in a query.They reduce data management effort and overhead by automating some of the most tedious lakehouse maintenance tasks. Fig 1.
Cloudera internal benchmarks demonstrate significant cost savings using Cloudera Lakehouse Optimizer to maintain Iceberg tables.Actual results will vary depending on actual usage.The Road Ahead The features we are launching in Cloudera Lakehouse Optimizer solve two very important challenges for companies who want to move to an open data lakehouse architecture.
This is just the first step in advancing Cloudera’s vision of making it easier than ever to deliver a high-performance view of your data.Down the road, we plan to add support for more optimization features, including reorganizing partitions to solve data distribution problems that can impact query performance, and query optimization.The goal for all of these features is to ensure that Cloudera is the best platform for managing and delivering access to Iceberg tables, and that the path to adopting an open data lakehouse is easier than ever.
Our Open Data Lakehouse is Free to Try
You can try Cloudera’s open data lakehouse on AWS for free today.Go sign up for our 5-day trial here to see for yourself.Bill Zhang
Senior Director Product Management, Data Warehousing
More by this author
Editor's Choice
Business
Acquisition of Verta’s Operational AI Platform Will Transform Cloudera’s AI Vision to Reality
Business
Bringing Financial Services Business Use Cases to Life: Leveraging Data Analytics, ML/AI, and Gen AI