Re: [PR] Docs: add apache amoro(incubating) with iceberg (#11965) [iceberg]

via GitHub Mon, 17 Feb 2025 20:06:30 -0800


manuzhang commented on code in PR #11966:
URL: https://github.com/apache/iceberg/pull/11966#discussion_r1959029226



##########
docs/docs/amoro.md:
##########
@@ -0,0 +1,89 @@
+---
+title: "Apache Amoro"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Apache Amoro With Iceberg
+
+**[Apache Amoro(incubating)](https://amoro.apache.org)** is a Lakehouse 
management system built on open data lake formats. Working with compute engines 
including Flink, Spark, and Trino, Amoro brings pluggable and
+**[Table Maintenance](https://amoro.apache.org/docs/latest/self-optimizing/)** 
features for a Lakehouse to provide out-of-the-box data warehouse experience, 
and helps data platforms or products easily build infra-decoupled, 
stream-and-batch-fused and lake-native architecture.
+**AMS(Amoro Management Service)**  provides Lakehouse management features, 
like self-optimizing, data expiration, etc. It also provides a unified catalog 
service for all compute engines, which can also be combined with existing 
metadata services like HMS(Hive Metastore).
+
+# Auto Self-optimizing
+
+Lakehouse is characterized by its openness and loose coupling, with data and 
files maintained by users through various engines. While this
+architecture appears to be well-suited for T+1 scenarios, as more attention is 
paid to applying Lakehouse to streaming data warehouses and real-time
+analysis scenarios, challenges arise. For example:
+
+- Streaming writes bring a massive amount of fragment files
+- CDC ingestion and streaming updates generate excessive redundant data
+- Using the new data lake format leads to orphan files and expired snapshots.
+
+These issues can significantly affect the performance and cost of data 
analysis. Therefore, Amoro has introduced a Self-optimizing mechanism to
+create an out-of-the-box Streaming Lakehouse management service that is as 
user-friendly as a traditional database or data warehouse. Self-optimizing 
involves various procedures such as file compaction, deduplication, and sorting.
+
+The architecture and working mechanism of Self-optimizing are shown in the 
figure below:
+
+![Self-optimizing 
architecture](https://github.com/apache/amoro/blob/master/docs/images/concepts/self-optimizing_arch.png)
+
+The Optimizer is a component responsible for executing Self-optimizing tasks. 
It is a resident process managed by 
[AMS](https://amoro.apache.org/docs/latest/#architecture). AMS is responsible 
for

Review Comment:
   Why is AMS not linked when it's firstly introduced above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Docs: add apache amoro(incubating) with iceberg (#11965) [iceberg]

Reply via email to