scheler opened a new issue, #15704: URL: https://github.com/apache/pinot/issues/15704
Requesting comments on the following proposal to add `quantileIndex` support for fast percentile queries. **Description**: Currently, Apache Pinot does not provide a native, automatic way to optimize percentile and quantile queries. Users often resort to manually creating and ingesting sketches (such as KLL or t-digest) outside of Pinot, which results in redundant data, complex workflows, and inefficient query execution. The sketches must be merged at query time, leading to full scans of the data, which defeats the purpose of using sketches for optimization. This proposal suggests adding a quantileIndex to Pinot. This index type would allow the system to automatically build and store quantile sketches (e.g., KLL sketches) during segment creation, enabling fast percentile queries without requiring full scans or manual sketch creation. **Motivation**: Simplifies user experience: No need to manage sketch columns or manual aggregation. Improves query performance: Merged sketches at the segment level, reducing the need for row-level processing at query time. Aligns with Pinot's indexing model: Like existing indexes (inverted, bloom, range), the quantileIndex can be treated as an optional, transparent optimization. **Proposed Solution**: Implement the quantileIndex as a new index type in Pinot that supports the creation and querying of KLL sketches (initially), with potential for supporting other sketch types in the future. During segment generation, Pinot would generate a single sketch for each quantile-indexed column. At query time, the planner can leverage this index to quickly return approximate percentile values without scanning all rows. **Storage Impact**: Sketch Storage: Each segment will contain the merged KLL sketch for the column. This will add some storage overhead, but the sketch is compact compared to raw data. Segment-level Storage: The sketch will increase storage per segment, but the space is used efficiently to provide significant performance benefits during percentile queries. Selective Indexing: Users can choose which columns to apply the quantileIndex to, allowing for targeted optimization and minimal storage overhead for less critical columns. **Benefits**: Faster percentile queries by using pre-computed, merged sketches. Reduced ingestion and storage overhead compared to storing a sketch per row. No change to existing query syntax: Users can continue using PERCENTILE or PERCENTILEEST functions as usual. **Interface Definitions**: 1. Setting Up the Index in the Schema: To apply the quantileIndex on a column, the user would define the index type in the schema configuration. The index can be configured with optional parameters, like k, which controls the accuracy of the KLL sketch. Example Schema Configuration: ``` { "columns": [ { "name": "latency_ms", "dataType": "FLOAT", "indexTypes": ["quantileIndex"], "quantileIndexConfig": { "k": 200 } } ] } ``` In this example, the quantileIndex is applied to the latency_ms column with the parameter k set to 200 (number of summary points in the KLL sketch). 2. Querying Percentiles: The query syntax for percentiles would remain the same as it is in Pinot today. When a quantileIndex is present for the column, Pinot will automatically use the pre-merged sketch to speed up the percentile calculation. Example SQL Query: ` SELECT PERCENTILE(latency_ms, 95) FROM my_table` The query planner will recognize the quantileIndex and efficiently return the 95th percentile using the precomputed KLL sketch data. **Supported Primitive Data Types for quantileIndex:** The quantileIndex can be used for numeric columns where percentile and quantile calculations are meaningful. - INT - BIGINT - FLOAT - DOUBLE - DECIMAL These data types represent continuous numeric values, which are ideal for computing percentiles. The KLL sketch (or similar sketches) will be computed based on these columns during segment creation, allowing fast approximate percentile queries at query time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org