mapleFU opened a new issue, #43382: URL: https://github.com/apache/arrow/issues/43382
### Describe the bug, including details regarding any error messages, version, and platform. ## The problem The min-max statistics would being truncated during write, as the code below: ```c++ EncodedStatistics chunk_statistics = GetChunkStatistics(); chunk_statistics.ApplyStatSizeLimits( properties_->max_statistics_size(descr_->path())); chunk_statistics.set_is_signed(SortOrder::SIGNED == descr_->sort_order()); ``` `ApplyStatSizeLimits` will try to truncate min-max if greater than `properties_->max_statistics_size(descr_->path()))` , which default is 4096 Bytes ```c++ // From parquet-mr // Don't write stats larger than the max size rather than truncating. The // rationale is that some engines may use the minimum value in the page as // the true minimum for aggregations and there is no way to mark that a // value has been truncated and is a lower bound and not in the page. void ApplyStatSizeLimits(size_t length) { if (max_.length() > length) { has_max = false; max_.clear(); } if (min_.length() > length) { has_min = false; min_.clear(); } } ``` The code is right here. But during consuming this api, the code is here: ``` template <typename DType> static std::shared_ptr<Statistics> MakeTypedColumnStats( const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) { // If ColumnOrder is defined, return max_value and min_value if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) { return MakeStatistics<DType>( descr, metadata.statistics.min_value, metadata.statistics.max_value, metadata.num_values - metadata.statistics.null_count, metadata.statistics.null_count, metadata.statistics.distinct_count, metadata.statistics.__isset.max_value || metadata.statistics.__isset.min_value, metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count); } // Default behavior return MakeStatistics<DType>( descr, metadata.statistics.min, metadata.statistics.max, metadata.num_values - metadata.statistics.null_count, metadata.statistics.null_count, metadata.statistics.distinct_count, metadata.statistics.__isset.max || metadata.statistics.__isset.min, metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count); } ``` The problem is that `||` is being used for min-max statistics existence. And the final result just have a `has_min_max_state`. As a result, for example, a statistics has : ``` min: "" max: "..." <-- an 10000Bytes string ``` The stored is `has_min: true, min: "", has_max: false`. And the loaded stats is `has_min_max:true, min="", max=""`, which is a bug here. ## Solving This is because currently, `HasMinMax` is "has min or max", we can have solvings below: 1. Change `MakeTypedColumnStats` to use `&&` rather than `||` 2. Propose a new api for `HasMinAndMax`, and use this api for pruning ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org