osscm commented on code in PR #16961: URL: https://github.com/apache/iceberg/pull/16961#discussion_r3509591538
########## format/index.md: ########## @@ -0,0 +1,310 @@ +--- +title: "Index Spec" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Iceberg Index Specification + +## Background and Motivation + +Indexes enable query engines to locate relevant rows without scanning entire datasets. +They can accelerate point lookups, range predicates, and other retrieval patterns +while preserving Iceberg's table format, snapshot isolation, and interoperability. + +Indexes are optional. Engines may choose to create, maintain, consume, or ignore them. + +## Goals + +- Define a portable metadata format for indexes +- Provide a common storage architecture for index data +- Allow indexes to evolve independently from table metadata as catalog-managed objects +- Enable index sharing across engines +- Provide a framework for defining new index types and transform functions + +## Overview + +Indexes are stored as a collection of files with some Iceberg table like semantics. At a high level they consist of a tracking file (similar to a root manifest file) which contains listings for a defined set of leaf files (similar to data files.) Leaf files store an ordered set of rows containing at least a key and the path of a Iceberg Table data file and the position within that file where the row where that key is stored. The organization of leaf files is defined by an Indexing Transform which varies based on the type of index. This structure is recorded in an Index metadata.json file which contains a set of snapshots, each of which points to a single tracking file mapping to the complete state of an Iceberg table at a given Iceberg table snapshot. + +Like Iceberg tables, views, and functions: + +- Metadata files (index metadata and tracking files) and data files (leaf files) are immutable +- Updates create new metadata files +- Catalogs perform atomic metadata swaps + +Each index snapshot references a tracking file which describes the leaf files belonging to the snapshot. + +```text +Index Metadata + | + +-- Index Snapshot + | + +-- Tracking File Review Comment: I believe both has pros/con..... 2 level can be a problem while analyzing the metadata, and need to list the files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
