cbb330 opened a new issue, #49360:
URL: https://github.com/apache/arrow/issues/49360

   ### Summary
   
   Part 1 of ORC predicate pushdown (#48986).
   
   Add a public API to `ORCFileReader` for accessing stripe-level and 
file-level column statistics as Arrow types. This is the foundation that the 
dataset layer will consume for predicate evaluation.
   
   ### Changes
   
   **New struct** in `adapter.h`:
   ```cpp
   struct OrcColumnStatistics {
       bool has_null;
       int64_t num_values;
       bool has_min_max;
       std::shared_ptr<Scalar> min;  // Arrow scalar
       std::shared_ptr<Scalar> max;  // Arrow scalar
   };
   ```
   
   **New methods** on `ORCFileReader`:
   - `GetColumnStatistics(int column_index)` — file-level statistics
   - `GetStripeColumnStatistics(int64_t stripe_index, int column_index)` — 
stripe-level statistics
   
   **Internal helper** `ConvertColumnStatistics()` that downcasts liborc 
`ColumnStatistics` to typed subclasses and produces the appropriate Arrow 
scalar:
   
   | ORC Statistics Type | Arrow Scalar | Notes |
   |---------------------|-------------|-------|
   | IntegerColumnStatistics | Int64Scalar | Covers BYTE, SHORT, INT, LONG |
   | DoubleColumnStatistics | DoubleScalar | NaN guard: has_min_max=false if 
NaN |
   | StringColumnStatistics | StringScalar | |
   | BooleanColumnStatistics | (no min/max) | Populate num_values from 
true+false counts |
   | DateColumnStatistics | Date32Scalar | Days since epoch |
   | TimestampColumnStatistics | TimestampScalar (NANO) | millis * 1_000_000 + 
sub-millis nanos |
   | DecimalColumnStatistics | Decimal128Scalar | Scale consistency check |
   
   **Validation:**
   - Bounds checking on column_index and stripe_index before cast to uint32_t
   - NaN guard on double statistics
   - Decimal scale consistency (min.scale must equal max.scale)
   - Uses `ORC_BEGIN_CATCH_NOT_OK` / `ORC_END_CATCH_NOT_OK` for exception 
handling
   
   ### Tests
   
   Unit tests in `adapter_test.cc`:
   - Integer column statistics (file-level and stripe-level)
   - String column statistics
   - Boolean column statistics (no min/max)
   - Date column statistics
   - Timestamp column statistics
   - Decimal column statistics
   - Out-of-range column/stripe index → error
   - Columns with nulls
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to