scheler opened a new issue, #15704:
URL: https://github.com/apache/pinot/issues/15704

   Requesting comments on the following proposal to add `quantileIndex` support 
for fast percentile queries.
   
   **Description**:
   Currently, Apache Pinot does not provide a native, automatic way to optimize 
percentile and quantile queries. Users often resort to manually creating and 
ingesting sketches (such as KLL or t-digest) outside of Pinot, which results in 
redundant data, complex workflows, and inefficient query execution. The 
sketches must be merged at query time, leading to full scans of the data, which 
defeats the purpose of using sketches for optimization.
   
   This proposal suggests adding a quantileIndex to Pinot. This index type 
would allow the system to automatically build and store quantile sketches 
(e.g., KLL sketches) during segment creation, enabling fast percentile queries 
without requiring full scans or manual sketch creation.
   
   **Motivation**:
   Simplifies user experience: No need to manage sketch columns or manual 
aggregation.
   
   Improves query performance: Merged sketches at the segment level, reducing 
the need for row-level processing at query time.
   
   Aligns with Pinot's indexing model: Like existing indexes (inverted, bloom, 
range), the quantileIndex can be treated as an optional, transparent 
optimization.
   
   **Proposed Solution**:
   Implement the quantileIndex as a new index type in Pinot that supports the 
creation and querying of KLL sketches (initially), with potential for 
supporting other sketch types in the future.
   
   During segment generation, Pinot would generate a single sketch for each 
quantile-indexed column.
   
   At query time, the planner can leverage this index to quickly return 
approximate percentile values without scanning all rows.
   
   **Storage Impact**:
   Sketch Storage: Each segment will contain the merged KLL sketch for the 
column. This will add some storage overhead, but the sketch is compact compared 
to raw data.
   
   Segment-level Storage: The sketch will increase storage per segment, but the 
space is used efficiently to provide significant performance benefits during 
percentile queries.
   
   Selective Indexing: Users can choose which columns to apply the 
quantileIndex to, allowing for targeted optimization and minimal storage 
overhead for less critical columns.
   
   **Benefits**:
   Faster percentile queries by using pre-computed, merged sketches.
   
   Reduced ingestion and storage overhead compared to storing a sketch per row.
   
   No change to existing query syntax: Users can continue using PERCENTILE or 
PERCENTILEEST functions as usual.
   
   **Interface Definitions**:
   1. Setting Up the Index in the Schema:
   To apply the quantileIndex on a column, the user would define the index type 
in the schema configuration. The index can be configured with optional 
parameters, like k, which controls the accuracy of the KLL sketch.
   
   Example Schema Configuration:
   
   ```
   
   {
     "columns": [
       {
         "name": "latency_ms",
         "dataType": "FLOAT",
         "indexTypes": ["quantileIndex"],
         "quantileIndexConfig": {
           "k": 200
         }
       }
     ]
   }
   ```
   In this example, the quantileIndex is applied to the latency_ms column with 
the parameter k set to 200 (number of summary points in the KLL sketch).
   
   2. Querying Percentiles:
   The query syntax for percentiles would remain the same as it is in Pinot 
today. When a quantileIndex is present for the column, Pinot will automatically 
use the pre-merged sketch to speed up the percentile calculation.
   
   Example SQL Query:
   
   `
   SELECT PERCENTILE(latency_ms, 95) FROM my_table`
   The query planner will recognize the quantileIndex and efficiently return 
the 95th percentile using the precomputed KLL sketch data.
   
   **Supported Primitive Data Types for quantileIndex:**
   The quantileIndex can be used for numeric columns where percentile and 
quantile calculations are meaningful.
   
   - INT
   - BIGINT
   - FLOAT
   - DOUBLE
   - DECIMAL
   
   These data types represent continuous numeric values, which are ideal for 
computing percentiles. The KLL sketch (or similar sketches) will be computed 
based on these columns during segment creation, allowing fast approximate 
percentile queries at query time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to