This is an automated email from the ASF dual-hosted git repository.

paleolimbot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 8d44eea71a GH-46270: [C++][Parquet] Clarify GeoStatistics docstring 
(#46649)
8d44eea71a is described below

commit 8d44eea71a49b42764cc8f07af6d26dce3546489
Author: Dewey Dunnington <[email protected]>
AuthorDate: Mon Jun 2 11:09:13 2025 -0500

    GH-46270: [C++][Parquet] Clarify GeoStatistics docstring (#46649)
    
    ### Rationale for this change
    
    The distinction between "invalid" and "empty" is not clear in the current 
documentation!
    
    ### What changes are included in this PR?
    
    The docstring for GeoStatistics was improved.
    
    ### Are these changes tested?
    
    Just documention!
    
    ### Are there any user-facing changes?
    
    No
    * GitHub Issue: #46270
    
    Authored-by: Dewey Dunnington <[email protected]>
    Signed-off-by: Dewey Dunnington <[email protected]>
---
 cpp/src/parquet/geospatial/statistics.h | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/cpp/src/parquet/geospatial/statistics.h 
b/cpp/src/parquet/geospatial/statistics.h
index 62f667e986..fb4c31af1b 100644
--- a/cpp/src/parquet/geospatial/statistics.h
+++ b/cpp/src/parquet/geospatial/statistics.h
@@ -63,18 +63,22 @@ class GeoStatisticsImpl;
 /// \brief Base type for computing geospatial column statistics while writing 
a file
 /// or representing them when reading a file
 ///
-/// Note that NaN values that were encountered within coordinates are omitted; 
however,
-/// NaN values that were obtained via decoding encoded statistics are 
propagated. This
-/// behaviour ensures C++ clients that are inspecting statistics via the 
column metadata
-/// can detect the case where a writer generated NaNs (even though this 
implementation
-/// does not generate them).
-///
-/// The handling of NaN values in coordinates is not well-defined among 
bounding
-/// implementations except for the WKB convention for POINT EMPTY, which is 
consistently
-/// represented as a point whose ordinates are all NaN. Any other geometry 
that contains
-/// NaNs cannot expect defined behaviour here or elsewhere; however, a row 
group that
-/// contains both NaN-containing and normal (completely finite) geometries 
should not be
-/// excluded from predicate pushdown.
+/// These statistics track the minimum and maximum value (omitting NaN values) 
of the
+/// four possible dimensions (X, Y, Z, and M) and the distinct set of geometry
+/// type/dimension combinations (e.g., point XY, linestring XYZM) present in 
the data.
+/// Any of these individual components may be "invalid": for example, when 
reading a
+/// Parquet file, information about individual components obtained from the 
column
+/// chunk metadata may have been missing or deemed unusable. Orthogonally,
+/// any of these individual components may be "empty": for example, when using
+/// GeoStatistics to accumulate bounds whilst writing, if all geometries in a 
column chunk
+/// are null, all ranges (X, Y, Z, and M) will be empty. If all geometries in 
a column
+/// chunk contain only XY coordinates (the most common case), the Z and M 
ranges will
+/// be empty but the X and Y ranges will contain finite bounds. Empty ranges 
are
+/// considered "valid" because they are known to represent exactly zero values 
(in
+/// contrast to an invalid range, whose contents is completely unknown). These 
concepts
+/// are all necessary for this object to accurately represent (1) accumulated 
or partially
+/// accumulated statistics during the writing process and (2) deserialized 
statistics read
+/// from the column chunk metadata during the reading process.
 ///
 /// EXPERIMENTAL
 class PARQUET_EXPORT GeoStatistics {

Reply via email to