mapleFU opened a new issue, #43382:
URL: https://github.com/apache/arrow/issues/43382

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## The problem
   
   The min-max statistics would being truncated during write, as the code below:
   
   ```c++
       EncodedStatistics chunk_statistics = GetChunkStatistics();
       chunk_statistics.ApplyStatSizeLimits(
           properties_->max_statistics_size(descr_->path()));
       chunk_statistics.set_is_signed(SortOrder::SIGNED == 
descr_->sort_order());
   ```
   
   `ApplyStatSizeLimits` will try to truncate min-max if greater than 
`properties_->max_statistics_size(descr_->path()))` , which default is 4096 
Bytes
   
   ```c++
     // From parquet-mr
     // Don't write stats larger than the max size rather than truncating. The
     // rationale is that some engines may use the minimum value in the page as
     // the true minimum for aggregations and there is no way to mark that a
     // value has been truncated and is a lower bound and not in the page.
     void ApplyStatSizeLimits(size_t length) {
       if (max_.length() > length) {
         has_max = false;
         max_.clear();
       }
       if (min_.length() > length) {
         has_min = false;
         min_.clear();
       }
     }
   ```
   
   The code is right here.
   
   But during consuming this api, the code is here:
   
   ```
   template <typename DType>
   static std::shared_ptr<Statistics> MakeTypedColumnStats(
       const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
     // If ColumnOrder is defined, return max_value and min_value
     if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) {
       return MakeStatistics<DType>(
           descr, metadata.statistics.min_value, metadata.statistics.max_value,
           metadata.num_values - metadata.statistics.null_count,
           metadata.statistics.null_count, metadata.statistics.distinct_count,
           metadata.statistics.__isset.max_value || 
metadata.statistics.__isset.min_value,
           metadata.statistics.__isset.null_count,
           metadata.statistics.__isset.distinct_count);
     }
     // Default behavior
     return MakeStatistics<DType>(
         descr, metadata.statistics.min, metadata.statistics.max,
         metadata.num_values - metadata.statistics.null_count,
         metadata.statistics.null_count, metadata.statistics.distinct_count,
         metadata.statistics.__isset.max || metadata.statistics.__isset.min,
         metadata.statistics.__isset.null_count, 
metadata.statistics.__isset.distinct_count);
   }
   ```
   
   The problem is that `||` is being used for min-max statistics existence. And 
the final result just have a `has_min_max_state`.
   
   As a result, for example, a statistics has :
   
   ```
   min: ""
   max: "..." <-- an 10000Bytes string
   ```
   
   The stored is `has_min: true, min: "", has_max: false`. And the loaded stats 
is `has_min_max:true, min="", max=""`, which is a bug here.
   
   ## Solving
   
   This is because currently, `HasMinMax` is "has min or max", we can have 
solvings below:
   
   1. Change `MakeTypedColumnStats` to use `&&` rather than `||`
   2. Propose a new api for `HasMinAndMax`, and use this api for pruning
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to