Re: [DISCUSS] Allow Page-level GEO statistics

Blake Orth Wed, 01 Apr 2026 16:34:34 -0700

Hi all,

My apologies for the delay in getting around to this. Unfortunately, most
of the time open-source contributions can't be my top priority so they're
more of an "as I have time" situation (bummer). However, I did end up
getting a bit of breathing room and was able to implement a POC (at POC
quality, not PR quality) for writing geospatial statistics at the page
level. Feel free to look/comment/review or run the implementation here:
https://github.com/BlakeOrth/arrow-rs/pull/1


As the description notes, comparing things like file size between the POC
and the existing implementation is as simple as switching branches back to
`main` and running the exact same commands. Since size overhead was an
initial point of concern for this effort, I took the liberty to get some
numbers for that up front. I've used the same test fixture that was used
for my previously noted query benchmarks tokeep things consistent. This is
also the same file that's referenced in the PR linked above. I first
rewrote the test fixture using `arrow-rs/parquet` just to ensure
consistency in the comparison between parquet writer settings (page size,
compression level etc.).

re-written test fixture on `main`: 1138395788 bytes
re-written test fixture with POC page stats: 1138457540 bytes

This means the addition of page level stats for this file results in a file
that is 61752 bytes larger, an increase of 0.0054%.

I'm looking forward to hearing everyone's thoughts on this. I believe there
was also a request to have some additional benchmarks that are a bit easier
to run, which I can start looking into.

-Blake

On Fri, Mar 6, 2026 at 8:00 PM Dewey Dunnington <[email protected]>
wrote:

> Hi Blake et al.,
>
> Thank you for driving this and apologies for arriving late to this
> thread. I don't have much to add on top of what you've already written
> here...I am excited that we have gotten to a point with Parquet and
> geometry/geography that we're seeing interest at optimizing at this
> level. I don't think anybody who has worked with page statistics would
> be surprised that small range queries can be made much more performant
> when page statistics are available. Small range queries for maps
> (e.g., zooming around an interactive map!) are currently better served
> by other formats that can embed a row-level spatial index and I think
> this proposal will expand the utility of Parquet to make it
> competitive in those situations as well.
>
> In addition to the speed and potential reduction in IO for small range
> queries, I think this will be a useful change because it helps unify
> the spatial handling with other pruning code that operates on
> page-level statistics.
>
> I'm happy to implement this in C++ when the time comes or do a wider
> set of experiments if the community has specific concerns!
>
> Cheers,
>
> -dewey
>
> On Sun, Mar 1, 2026 at 6:04 AM Andrew Lamb <[email protected]> wrote:
> >
> > > Andrew, regarding your recommendation on how to drive this forward,
> which
> > I
> > > believe I should have time to do. Is the request to effectively modify
> a
> > > parquet library (I'd use arrow-rs) to actually write and read spatial
> > > statistics in the page index? I don't expect readers here to have
> reviewed
> > > the code I've already provided, but I'm trying to understand how your
> > > suggestion differs from the benchmarks I've already explored.
> >
> > Yes
> >
> > I suggest modifying an existing system with the proposed changes so that
> > 1. It is 100% clear what is proposed (we can review the code)
> > 2. We don't have to speculate about the potential costs / benefits (we
> can
> > instead measure them directly with the proposal)
> >
> >
> >
> >
> > On Thu, Feb 26, 2026 at 4:33 PM Blake Orth <[email protected]> wrote:
> >
> > > Thanks for the engagement, everyone. I'm glad there's generally some
> > > interest in exploring this idea.
> > >
> > > Arnav, to address your questions I'll actually address the 2nd one
> first as
> > > it adds some overall context to the discussion.
> > > 2. Dataset
> > > As you noted, the test fixture was derived from Overture Maps
> Foundation
> > > GeoParquet data. It is a single file selected from the building
> footprints
> > > partition. This means it represents a collection of 2D polygons and
> their
> > > associated properties. The processing dropped the "GeoParquet" specific
> > > data (key-value metadata and dedicated bbox "covering columns") and
> wrote a
> > > new parquet file containing the Parquet geometry metadata. All the
> > > processing was done using GDAL with its default settings. Due to GDAL's
> > > overwhelming prevalence in processing spatial data we believe this is a
> > > pretty good representation of what we expect to see in real world use
> > > cases. The output file maintains the row ordering of the input. This is
> > > somewhat important to note because Overture data is internally
> partitioned,
> > > placing co-located geometries into the same Row Group (I believe based
> on
> > > their level 3 GeoHash), however they are not "sorted" in a traditional
> > > sense. GDAL defaults to blindly truncating Row Groups to 65536 rows.
> All
> > > this is to say that while the test fixture is generally "well formed"
> > > spatially, it doesn't represent a solution optimized for page-level
> > > pruning.
> > >
> > > 1. Storage overhead/compression ratio:
> > > I don't have specific measurements now, but I can provide exact
> numbers for
> > > this case if needed. I noted that the page statistics are "simulated"
> > > because I don't actually have a prototype implementation to write them
> to
> > > the file. This initial effort just collects the data that would exist
> in a
> > > page index (covering bbox) for each page and stores it for use during a
> > > scan. For discussion's sake, we can do some quick "napkin math" in the
> mean
> > > time. The current spatial statistics bbox is 4 required doubles and 4
> > > optional doubles. Unless I'm mistaken, this should yield 32 to 64
> bytes per
> > > page in the metadata. This test fixture has 2D polygons, so it will
> use the
> > > 32 byte bbox. Approximately 1,800 pages would result in 57,600 bytes.
> For a
> > > file that's about 1.1GB, adding the addition of the bbox in the page
> > > statistics would increase the file size by about 0.00005%. The
> "geometry"
> > > column accounts for the bulk of the file's compressed size. Row Group
> > > statistics suggest that the geometry column's compressed size is
> generally
> > > between 8MB and 9MB. With 104 Row Groups in the file it's probably
> safe to
> > > assume about 850MB of compressed geometry data. Again, if we want more
> > > exact numbers, let me know and I can provide them in a follow-up.
> > >
> > > I think Andrew's points are important here: the writer primarily
> drives the
> > > effectiveness of page-level statistics, both in terms of compression
> ratio
> > > and pruning potential. I don't feel like this is a geospatial specific
> > > statement either, though it's probably more applicable to any column
> > > represented as Binary than to primitive types.
> > >
> > > Andrew, regarding your recommendation on how to drive this forward,
> which I
> > > believe I should have time to do. Is the request to effectively modify
> a
> > > parquet library (I'd use arrow-rs) to actually write and read spatial
> > > statistics in the page index? I don't expect readers here to have
> reviewed
> > > the code I've already provided, but I'm trying to understand how your
> > > suggestion differs from the benchmarks I've already explored.
> > >
> > > -Blake
> > >
> > >
> > >
> > >
> > > On Thu, Feb 26, 2026 at 4:49 AM Andrew Lamb <[email protected]>
> > > wrote:
> > >
> > > > Personally I think having page level statistics is a good idea and I
> am
> > > not
> > > > sure we need to do a lot more empirical evaluation before doing a
> POC.
> > > >
> > > > I think the overhead of page-level statistics will depend almost
> entirely
> > > > on how you configure the parquet writer. For example, if you
> configure
> > > the
> > > > writer for pages with 10000 of GEO points, the overhead of statistics
> > > will
> > > > be much lower than if you configure the writer with pages that have
> 100
> > > > points.
> > > >
> > > > The performance benefit of having such indexes will depend heavily
> on how
> > > > the data is distributed among the pages and what the query predicate
> is,
> > > > which will determine how much pruning is effective.
> > > >
> > > > I am not surprised there are reasonable usecases where page level
> > > > statistics make a substantial difference (which is what Blake
> appears to
> > > > have shown with the benchmark).
> > > >
> > > > My personal suggestion for anyone who is interested in driving this
> > > forward
> > > > is:
> > > > 1. Create a proof of concept (add the relevant statistics to the
> > > PageIndex)
> > > > 2. Demonstrate real world performance gains in a plausible benchmark
> > > >
> > > > Ideally the proof of concept would be enough to run the other
> experiments
> > > > suggested by Gang and Arnav.
> > > >
> > > > Andrew
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Feb 26, 2026 at 4:37 AM Gang Wu <[email protected]> wrote:
> > > >
> > > > > Thanks Blake for bringing this up!
> > > > >
> > > > > When we were adding the geospatial logical types, page-level geo
> stats
> > > > were
> > > > > not
> > > > > added by purpose to avoid storage bloat before real world use cases
> > > > appear.
> > > > >
> > > > > I agree with Arnav that we may need more concrete data to justify
> it.
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Blake,
> > > > > >
> > > > > > Thanks for sharing the benchmarks, the results look quite
> compelling.
> > > > > Page
> > > > > > level pruning seems like a promising direction, I had a couple of
> > > > > > questions:
> > > > > >
> > > > > > 1. Storage overhead/compression ratio:
> > > > > > Do you have measurements on the storage overhead by page level
> stats
> > > in
> > > > > > this benchmark? In particular, for the 1800 pages in the geometry
> > > > column:
> > > > > >
> > > > > >    - What was the approximate per page metadata size
> > > > > >    - Did you observe any impact on compression ratio/file size
> > > compared
> > > > > to
> > > > > >    the baseline?
> > > > > >
> > > > > > 2. Dataset:
> > > > > > Could you share more details on how the test fixture was
> derived? It
> > > > > > appears to be based on an Overture dataset, it would be helpful
> to
> > > > > > understand:
> > > > > >
> > > > > >    - What themes the data was drawn from (buildings, places,
> > > > > >    transportation, etc)
> > > > > >    - Does this represent a specific region, and the mix of
> geometry
> > > > types
> > > > > >    present.
> > > > > >
> > > > > > Additionally, have you considered evaluating this on other real
> world
> > > > > > datasets like OpenStreetMap to understand how the performance
> varies
> > > on
> > > > > > different data/spatial characteristics?
> > > > > >
> > > > > > Thanks,
> > > > > > Arnav
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > > I would like to start a discussion on allowing Page level
> > > statistics
> > > > > for
> > > > > > > the new GEO column types.
> > > > > > >
> > > > > > > If I understand correctly, the discussion during the
> formalization
> > > of
> > > > > GEO
> > > > > > > types initially included page-level statistics. However, the
> > > decision
> > > > > was
> > > > > > > made to only allow Row Group level statistics because there
> was no
> > > > > > > compelling evidence that Page statistics would meaningfully
> impact
> > > > > query
> > > > > > > performance enough to offset their potential impact on file
> size.
> > > > Some
> > > > > > > discussions with other members of the GeoParquet community
> prompted
> > > > me
> > > > > to
> > > > > > > build some benchmarks exploring the effect a specialized
> spatial
> > > > index
> > > > > > > could have on query performance.The benchmarks explored the
> > > > > > > differences between three cases: a base Parquet file that
> allows
> > > Row
> > > > > > Group
> > > > > > > level pruning, a case simulating Page level pruning using
> simple
> > > flat
> > > > > > > statistics (like the standard Parquet page stats structures),
> and
> > > > > > finally a
> > > > > > > case simulating Page level pruning using a specialized spatial
> > > index.
> > > > > The
> > > > > > > simple flat statistics performed nearly identically to the
> spatial
> > > > > index,
> > > > > > > and allowing page-level pruning improved query performance by
> > > almost
> > > > 2x
> > > > > > > over the base Row Group pruning. Considering these results we
> felt
> > > > that
> > > > > > > pursuing a specialized index specifically for GeoParquet is
> likely
> > > > > > > unnecessary. Allowing page-level statistics for GEO columns
> shows
> > > > > > > meaningful query performance gains.
> > > > > > >
> > > > > > > The source to reproduce the benchmarks can be found here, with
> some
> > > > > > simple
> > > > > > > instructions in the README on how to obtain the test fixture
> and
> > > run
> > > > > the
> > > > > > > benchmarks:
> > > > > > >
> https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks
> > > > > > >
> > > > > > > The benchmarks leverage a modified version of GeoDatafusion to
> > > > compute
> > > > > a
> > > > > > > relatively selective geometry intersection query, filtering
> > > > > approximately
> > > > > > > 3,000 geometries from a test fixture containing over 10,000,000
> > > rows.
> > > > > The
> > > > > > > file itself is approximately 1.1GB in size and has just over
> 1800
> > > > pages
> > > > > > in
> > > > > > > its geometry column. In this case, allowing statistics on those
> > > pages
> > > > > > > should represent minimal file size overhead.
> > > > > > >
> > > > > > > If anyone has any additional requests for benchmarks or
> information
> > > > on
> > > > > > the
> > > > > > > benchmarks provided please let me know!
> > > > > > >
> > > > > > > Thanks,
> > > > > > > -Blake Orth
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] Allow Page-level GEO statistics

Reply via email to