Re: [PR] Docs: add apache amoro(incubating) with iceberg (#11965) [iceberg]

via GitHub Mon, 17 Feb 2025 20:11:19 -0800


manuzhang commented on code in PR #11966:
URL: https://github.com/apache/iceberg/pull/11966#discussion_r1959031682



##########
docs/docs/amoro.md:
##########
@@ -0,0 +1,89 @@
+---
+title: "Apache Amoro"
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# Apache Amoro With Iceberg
+
+**[Apache Amoro(incubating)](https://amoro.apache.org)** is a Lakehouse 
management system built on open data lake formats. Working with compute engines 
including Flink, Spark, and Trino, Amoro brings pluggable and
+**[Table Maintenance](https://amoro.apache.org/docs/latest/self-optimizing/)** 
features for a Lakehouse to provide out-of-the-box data warehouse experience, 
and helps data platforms or products easily build infra-decoupled, 
stream-and-batch-fused and lake-native architecture.
+**AMS(Amoro Management Service)**  provides Lakehouse management features, 
like self-optimizing, data expiration, etc. It also provides a unified catalog 
service for all compute engines, which can also be combined with existing 
metadata services like HMS(Hive Metastore).
+
+# Auto Self-optimizing
+
+Lakehouse is characterized by its openness and loose coupling, with data and 
files maintained by users through various engines. While this
+architecture appears to be well-suited for T+1 scenarios, as more attention is 
paid to applying Lakehouse to streaming data warehouses and real-time
+analysis scenarios, challenges arise. For example:
+
+- Streaming writes bring a massive amount of fragment files
+- CDC ingestion and streaming updates generate excessive redundant data
+- Using the new data lake format leads to orphan files and expired snapshots.
+
+These issues can significantly affect the performance and cost of data 
analysis. Therefore, Amoro has introduced a Self-optimizing mechanism to
+create an out-of-the-box Streaming Lakehouse management service that is as 
user-friendly as a traditional database or data warehouse. Self-optimizing 
involves various procedures such as file compaction, deduplication, and sorting.
+
+The architecture and working mechanism of Self-optimizing are shown in the 
figure below:
+
+![Self-optimizing 
architecture](https://github.com/apache/amoro/blob/master/docs/images/concepts/self-optimizing_arch.png)
+
+The Optimizer is a component responsible for executing Self-optimizing tasks. 
It is a resident process managed by 
[AMS](https://amoro.apache.org/docs/latest/#architecture). AMS is responsible 
for
+detecting and planning Self-optimizing tasks for tables, and then scheduling 
them to Optimizers for distributed execution in real-time. Finally, AMS
+is responsible for submitting the optimizing results. Amoro achieves physical 
isolation of Optimizers through the Optimizer Group.
+
+The core features of [Amoro's 
Self-optimizing](https://amoro.apache.org/docs/latest/self-optimizing/) are:
+
+- Automated, Asynchronous and Transparent — Continuous background detecting of 
file changes, asynchronous distributed execution of optimizing tasks,
+  transparent and imperceptible to users
+- Resource Isolation and Sharing — Allow resources to be isolated and shared 
at the table level, as well as setting resource quotas
+- Flexible and Scalable Deployment — Optimizers support various deployment 
methods and convenient scaling
+
+
+# Iceberg Format

Review Comment:
   * I'd suggest adding a second-level header `Table Formats` with a 
third-level header for each format
   * The first two paragraphs below are not related to "Format"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Docs: add apache amoro(incubating) with iceberg (#11965) [iceberg]

Reply via email to