XBaith commented on code in PR #11966: URL: https://github.com/apache/iceberg/pull/11966#discussion_r1915887251
########## docs/docs/amoro.md: ########## @@ -0,0 +1,90 @@ +--- +title: "Apache Amoro" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Apache Amoro With Iceberg + +**[Apache Amoro(incubating)](https://amoro.apache.org/docs/latest/)** is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino, Amoro brings pluggable and +**[Table Maintenance](https://amoro.apache.org/docs/latest/self-optimizing/)** features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. + + +# Auto Self-optimizing + +Lakehouse is characterized by its openness and loose coupling, with data and files maintained by users through various engines. While this +architecture appears to be well-suited for T+1 scenarios, as more attention is paid to applying Lakehouse to streaming data warehouses and real-time +analysis scenarios, challenges arise. For example: + +- Streaming writes bring a massive amount of fragment files +- CDC ingestion and streaming updates generate excessive redundant data +- Using the new data lake format leads to orphan files and expired snapshots. + +These issues can significantly affect the performance and cost of data analysis. Therefore, Amoro has introduced a Self-optimizing mechanism to +create an out-of-the-box Streaming Lakehouse management service that is as user-friendly as a traditional database or data warehouse. The new table +format is used for this purpose. Self-optimizing involves various procedures such as file compaction, deduplication, and sorting. Review Comment: I believe it is a streaming-enhanced table format based on Iceberg. The table format essentially extends Iceberg by combining two Iceberg tables. We can remove this statement as it lacks clarity in the current context. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org