anoopj commented on code in PR #16518: URL: https://github.com/apache/iceberg/pull/16518#discussion_r3305614802
########## format/mumbling-spec.md: ########## @@ -0,0 +1,316 @@ +--- +title: "Mumbling Bitmap Spec" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Mumbling Bitmap Spec + +This is a specification for the Mumbling compressed bitmap format designed for +use cases with bounded total size, like a deletion vector. This is based on +[Roaring Bitmap][roaring]. + +This spec is for version 1. + +[roaring]: https://roaringbitmap.org/ + + +## Overview + +Mumbling bitmaps are based on the same idea as Roaring bitmaps: a bitmap is +divided into fixed-size (256 bit) regions, called _containers_. Each container +is either _sparse_ to store offsets (0-255) or _dense_ to store a bit set. The +main difference from Roaring is that Mumbling bitmaps use a smaller scale where +each container is at most 32 bytes, and are limited to at most 2,097,152 +values. + +Containers are tracked by an array of descriptor bytes, one per container. The +position of the descriptor byte in the array encodes the most significant bits +of the positions stored in its corresponding container (the _key_ in Roaring +bitmap). The descriptor encodes the container size for sparse containers (0-31 +values), or that a container is dense (32). Because the format uses a +descriptor array instead of keys and offsets, descriptor bytes are stored +as a PFOR-encoded array. + + +## Design choices + +Unlike Roaring, this format uses a descriptor array instead of storing a key, +cardinality, and offset for each container. + +The Iceberg use case is for small, embedded deletion vectors. These vectors +will be small, likely less than 100,000 entries. But, limiting the +per-container overhead by using a single byte for the key (256 containers) +would be too limiting (only 65,536 total entries). As a result, this would +require 2-byte keys and 2-byte offsets (8192 containers * 32 bytes / +container). + +The descriptor array avoids 4 bytes of overhead for each 32-byte container. The +key is implicit and is the offset into the descriptor array. Descriptors encode +the length of a container instead of its offset, requiring only one byte. + +The first trade-off of this approach is that each descriptor must be present, +even for 0-length (empty) containers. And because an empty state is needed, +descriptors cannot encode the container cardinality (up to 256). PFOR encoding +is used to reduce this overhead. + +The second trade-off is that offsets are not directly stored. However, offsets +can be computed from the relatively small descriptor array and finding the +descriptor for a key uses direct array indexing. + +An alternative to the descriptor array is to store the offsets, and use the +difference between offsets to find container length. This approach was not +chosen because array encoding would be worse (values are increasing), and the +remaining descriptor bits cannot be used. + + +## Format + +A Mumbling bitmap consists of 3 sections: + +* Header +* Descriptor array +* Containers + +Throughout the foramt, integers are unsigned and stored as little endian. + + +### Header + +The Mumbling header is made up of the following fields: + +| Field | Size | Description | +|-----------------|---------|-------------| +| Format version | 1 byte | `0x01` for version 1 | +| Cardinality | 3 bytes | The number of set bits | +| Container count | 2 bytes | The number of containers in the bitmap | + +Because the container count is limited to 8,192, cardinality is limited to +2,097,152 (8,192 containers of 256 bits). + + +### Descriptor array + +The descriptor array contains one descriptor byte per container. Its length is +the container count. + +The 3 most significant bits of the descriptor byte determine the container type +and how to interpret the remaining least significant bits of the descriptor. + +| MSB pattern | Container type | Description | Remaining bits | +|-------------|----------------|----------------------------------|----------------| +| `000` | Sparse | A sparse container of 0-31 bytes | Container length / number of bits set | +| `001` | Dense | A dense container of 32 bytes | Must be 0 | + +In v1, the descriptor byte encodes the size of its corresponding container. +The two most significant bits are reserved for future use and implementations +must ignore the least-significant bits for a dense container if they are set. + +Example descriptors: + +| Hex | Binary | Interpretation | +|------|-------------|----------------| +| `00` | `0000 0000` | Sparse container of 0 values | +| `05` | `0000 0101` | Sparse container of 5 values | +| `1F` | `0001 1111` | Sparse container of 31 values | +| `20` | `0010 0000` | Dense container stored in 32 bytes | + +The descriptor array is encoded using patched frame of reference (PFOR) Review Comment: Is the intuition that PFOR would be effective at the low-density end of the the scale? (ie the descriptors are mostly zeroes?) PFOR adds some complexity and I was trying to reason about the benefit of adding it. ########## format/mumbling-spec.md: ########## @@ -0,0 +1,316 @@ +--- +title: "Mumbling Bitmap Spec" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Mumbling Bitmap Spec + +This is a specification for the Mumbling compressed bitmap format designed for +use cases with bounded total size, like a deletion vector. This is based on +[Roaring Bitmap][roaring]. + +This spec is for version 1. + +[roaring]: https://roaringbitmap.org/ + + +## Overview + +Mumbling bitmaps are based on the same idea as Roaring bitmaps: a bitmap is +divided into fixed-size (256 bit) regions, called _containers_. Each container +is either _sparse_ to store offsets (0-255) or _dense_ to store a bit set. The +main difference from Roaring is that Mumbling bitmaps use a smaller scale where +each container is at most 32 bytes, and are limited to at most 2,097,152 +values. + +Containers are tracked by an array of descriptor bytes, one per container. The +position of the descriptor byte in the array encodes the most significant bits +of the positions stored in its corresponding container (the _key_ in Roaring +bitmap). The descriptor encodes the container size for sparse containers (0-31 +values), or that a container is dense (32). Because the format uses a +descriptor array instead of keys and offsets, descriptor bytes are stored +as a PFOR-encoded array. + + +## Design choices + +Unlike Roaring, this format uses a descriptor array instead of storing a key, +cardinality, and offset for each container. + +The Iceberg use case is for small, embedded deletion vectors. These vectors +will be small, likely less than 100,000 entries. But, limiting the +per-container overhead by using a single byte for the key (256 containers) +would be too limiting (only 65,536 total entries). As a result, this would +require 2-byte keys and 2-byte offsets (8192 containers * 32 bytes / +container). + +The descriptor array avoids 4 bytes of overhead for each 32-byte container. The +key is implicit and is the offset into the descriptor array. Descriptors encode +the length of a container instead of its offset, requiring only one byte. + +The first trade-off of this approach is that each descriptor must be present, +even for 0-length (empty) containers. And because an empty state is needed, +descriptors cannot encode the container cardinality (up to 256). PFOR encoding +is used to reduce this overhead. + +The second trade-off is that offsets are not directly stored. However, offsets +can be computed from the relatively small descriptor array and finding the +descriptor for a key uses direct array indexing. + +An alternative to the descriptor array is to store the offsets, and use the +difference between offsets to find container length. This approach was not +chosen because array encoding would be worse (values are increasing), and the +remaining descriptor bits cannot be used. + + +## Format + +A Mumbling bitmap consists of 3 sections: + +* Header +* Descriptor array +* Containers + +Throughout the foramt, integers are unsigned and stored as little endian. + + +### Header + +The Mumbling header is made up of the following fields: + +| Field | Size | Description | +|-----------------|---------|-------------| +| Format version | 1 byte | `0x01` for version 1 | +| Cardinality | 3 bytes | The number of set bits | +| Container count | 2 bytes | The number of containers in the bitmap | + +Because the container count is limited to 8,192, cardinality is limited to +2,097,152 (8,192 containers of 256 bits). + + +### Descriptor array + +The descriptor array contains one descriptor byte per container. Its length is +the container count. + +The 3 most significant bits of the descriptor byte determine the container type +and how to interpret the remaining least significant bits of the descriptor. + +| MSB pattern | Container type | Description | Remaining bits | +|-------------|----------------|----------------------------------|----------------| +| `000` | Sparse | A sparse container of 0-31 bytes | Container length / number of bits set | +| `001` | Dense | A dense container of 32 bytes | Must be 0 | + +In v1, the descriptor byte encodes the size of its corresponding container. +The two most significant bits are reserved for future use and implementations +must ignore the least-significant bits for a dense container if they are set. + +Example descriptors: + +| Hex | Binary | Interpretation | +|------|-------------|----------------| +| `00` | `0000 0000` | Sparse container of 0 values | +| `05` | `0000 0101` | Sparse container of 5 values | +| `1F` | `0001 1111` | Sparse container of 31 values | +| `20` | `0010 0000` | Dense container stored in 32 bytes | + +The descriptor array is encoded using patched frame of reference (PFOR) +documented in [Appendix A](pfor). PFOR was chosen because it can efficiently +store mostly uniform container sizes along with occasional larger values. The +binary representation for descriptors also allows saving at least 2 bits per +value. + +[pfor]: #appendix-a-pfor-encoding-for-unsigned-bytes + + +### Containers + +The containers section consists of concatenated containers. Each container +stores a 256 bit region of the bitmap. The number of containers is limited to +65,536 to fit into an unsigned 16-bit integer. + +Positions in the bitmap are split into the first 16 bits that identify the +container and the last 8 bits that identify the corresponding position within +the container. + +Containers may be sparse or dense. This type is encoded by the container's +corresponding descriptor byte. Containers with less then 32 bits set must be +sparse and containers with 32 or more bits set must be dense. + + +#### Sparse containers + +A sparse container encodes up to 31 set positions. All other positions are +unset. Each set position is stored as an unsigned byte (0-255), relative to the +start of the container. + +The list of positions must be sorted in ascending order. This allows checking +whether a specific position is set to stop when a higher position is reached or +the last position is reached. + +A sparse container's length is not stored in the container. It is the value of +the corresponding descriptor byte. + +Examples: + +| Descriptor | Container bytes | Set positions | +|------------|-------------------|---------------| +| 0 | (zero bytes) | None | +| 3 | `00 22 FF` | 0, 34, 255 | +| 31 | `00 01 02 ... 1F` | 0, 1, 2, ..., 31 | + + +### Dense containers + +A dense container encodes each bit of the container as 0 (unset) or 1 (set) in +a 32-byte array. + +The first byte in the byte array contains bits 0-7, the next byte contains +8-15, etc. Bit positions are ordered from most significant to least. As a +result, the 0th position in the container is the most significant bit of the +first byte, and the 255th position in the container is the least-significant +bit of the last byte. + +Examples: + +| Descriptor | Container bytes | Set positions | +|------------|-------------------------|---------------| +| 32 | `FF FF FF FF 00 ... 00` | 0-31 | +| 32 | `FF FF FF FF 80 ... 00` | 0-32 | +| 32 | `FF FF 00 ... 00 FF FF` | 0-15, 240-255 | +| 32 | `AA AA ... AA AA` | Even positions: 0, 2, 4, ... | + + +## Working with bitmaps + +Mumbling bitmaps use a descriptor array to track containers. For quick lookups, +implementations should decode the descriptor array and use it to produce an +offset array. + +The container for a position is accessed by finding its type and offset using +the descriptor array. The index into the descriptor array is the position +divided by 256: + +``` +let container_index: u16 = (pos >> 8) as u16 +``` + +The offset of the container is the sum of the lengths of previous containers, +as determined by container descriptors. + +``` +let descriptor: u8 = descriptors[container_index] +let offset: u16 = offsets[container_index] Review Comment: `offset` is a byte offset into the concatenated containers payload. So this doesn't fit into a `u16`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
