Re: [PR] Docs: add apache amoro(incubating) with iceberg (#11965) [iceberg]

via GitHub Tue, 14 Jan 2025 19:13:21 -0800


XBaith commented on code in PR #11966:
URL: https://github.com/apache/iceberg/pull/11966#discussion_r1915887251



##########
docs/docs/amoro.md:
##########
@@ -0,0 +1,90 @@
+---
+title: "Apache Amoro"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Apache Amoro With Iceberg
+
+**[Apache Amoro(incubating)](https://amoro.apache.org/docs/latest/)** is a 
Lakehouse management system built on open data lake formats. Working with 
compute engines including Flink, Spark, and Trino, Amoro brings pluggable and
+**[Table Maintenance](https://amoro.apache.org/docs/latest/self-optimizing/)** 
features for Lakehouse to provide out-of-the-box data warehouse experience, and 
helps data platforms or products easily build infra-decoupled, 
stream-and-batch-fused and lake-native architecture.
+
+
+# Auto Self-optimizing
+
+Lakehouse is characterized by its openness and loose coupling, with data and 
files maintained by users through various engines. While this
+architecture appears to be well-suited for T+1 scenarios, as more attention is 
paid to applying Lakehouse to streaming data warehouses and real-time
+analysis scenarios, challenges arise. For example:
+
+- Streaming writes bring a massive amount of fragment files
+- CDC ingestion and streaming updates generate excessive redundant data
+- Using the new data lake format leads to orphan files and expired snapshots.
+
+These issues can significantly affect the performance and cost of data 
analysis. Therefore, Amoro has introduced a Self-optimizing mechanism to
+create an out-of-the-box Streaming Lakehouse management service that is as 
user-friendly as a traditional database or data warehouse. The new table
+format is used for this purpose. Self-optimizing involves various procedures 
such as file compaction, deduplication, and sorting.

Review Comment:
   I believe it is a streaming-enhanced table format based on Iceberg. The 
table format essentially extends Iceberg by combining two Iceberg tables.
   
   We can remove this statement as it lacks clarity in the current context.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Docs: add apache amoro(incubating) with iceberg (#11965) [iceberg]

Reply via email to