Re: [DISCUSS] Allow Page-level GEO statistics

Dewey Dunnington Thu, 09 Apr 2026 20:24:00 -0700

Thank you Blake for putting this together! I left some minor comments
on the PR [1]...I think the PR nicely demonstrates that this change is
largely self-contained and doesn't affect non-geometry/geography code
paths. I believe this should also not affect the metadata sizes of
files that don't contain geometry columns but I didn't check.


Two things I will try to do in the next few weeks to help move this
forward are (1) put together a similar POC in C++ and (2) use your PR
to rewrite the geoarrow-data example files [2], which might give a
wider corpus of input on which to assert that this really does help
with the queries.

Cheers,

-dewey

[1] https://github.com/BlakeOrth/arrow-rs/pull/1
[2] https://geoarrow.org/data

On Wed, Apr 1, 2026 at 4:08 PM Blake Orth <[email protected]> wrote:
>
> Hi all,
>
> My apologies for the delay in getting around to this. Unfortunately, most
> of the time open-source contributions can't be my top priority so they're
> more of an "as I have time" situation (bummer). However, I did end up
> getting a bit of breathing room and was able to implement a POC (at POC
> quality, not PR quality) for writing geospatial statistics at the page
> level. Feel free to look/comment/review or run the implementation here:
> https://github.com/BlakeOrth/arrow-rs/pull/1
>
> As the description notes, comparing things like file size between the POC
> and the existing implementation is as simple as switching branches back to
> `main` and running the exact same commands. Since size overhead was an
> initial point of concern for this effort, I took the liberty to get some
> numbers for that up front. I've used the same test fixture that was used
> for my previously noted query benchmarks tokeep things consistent. This is
> also the same file that's referenced in the PR linked above. I first
> rewrote the test fixture using `arrow-rs/parquet` just to ensure
> consistency in the comparison between parquet writer settings (page size,
> compression level etc.).
>
> re-written test fixture on `main`: 1138395788 bytes
> re-written test fixture with POC page stats: 1138457540 bytes
>
> This means the addition of page level stats for this file results in a file
> that is 61752 bytes larger, an increase of 0.0054%.
>
> I'm looking forward to hearing everyone's thoughts on this. I believe there
> was also a request to have some additional benchmarks that are a bit easier
> to run, which I can start looking into.
>
> -Blake
>
> On Fri, Mar 6, 2026 at 8:00 PM Dewey Dunnington <[email protected]>
> wrote:
>
> > Hi Blake et al.,
> >
> > Thank you for driving this and apologies for arriving late to this
> > thread. I don't have much to add on top of what you've already written
> > here...I am excited that we have gotten to a point with Parquet and
> > geometry/geography that we're seeing interest at optimizing at this
> > level. I don't think anybody who has worked with page statistics would
> > be surprised that small range queries can be made much more performant
> > when page statistics are available. Small range queries for maps
> > (e.g., zooming around an interactive map!) are currently better served
> > by other formats that can embed a row-level spatial index and I think
> > this proposal will expand the utility of Parquet to make it
> > competitive in those situations as well.
> >
> > In addition to the speed and potential reduction in IO for small range
> > queries, I think this will be a useful change because it helps unify
> > the spatial handling with other pruning code that operates on
> > page-level statistics.
> >
> > I'm happy to implement this in C++ when the time comes or do a wider
> > set of experiments if the community has specific concerns!
> >
> > Cheers,
> >
> > -dewey
> >
> > On Sun, Mar 1, 2026 at 6:04 AM Andrew Lamb <[email protected]> wrote:
> > >
> > > > Andrew, regarding your recommendation on how to drive this forward,
> > which
> > > I
> > > > believe I should have time to do. Is the request to effectively modify
> > a
> > > > parquet library (I'd use arrow-rs) to actually write and read spatial
> > > > statistics in the page index? I don't expect readers here to have
> > reviewed
> > > > the code I've already provided, but I'm trying to understand how your
> > > > suggestion differs from the benchmarks I've already explored.
> > >
> > > Yes
> > >
> > > I suggest modifying an existing system with the proposed changes so that
> > > 1. It is 100% clear what is proposed (we can review the code)
> > > 2. We don't have to speculate about the potential costs / benefits (we
> > can
> > > instead measure them directly with the proposal)
> > >
> > >
> > >
> > >
> > > On Thu, Feb 26, 2026 at 4:33 PM Blake Orth <[email protected]> wrote:
> > >
> > > > Thanks for the engagement, everyone. I'm glad there's generally some
> > > > interest in exploring this idea.
> > > >
> > > > Arnav, to address your questions I'll actually address the 2nd one
> > first as
> > > > it adds some overall context to the discussion.
> > > > 2. Dataset
> > > > As you noted, the test fixture was derived from Overture Maps
> > Foundation
> > > > GeoParquet data. It is a single file selected from the building
> > footprints
> > > > partition. This means it represents a collection of 2D polygons and
> > their
> > > > associated properties. The processing dropped the "GeoParquet" specific
> > > > data (key-value metadata and dedicated bbox "covering columns") and
> > wrote a
> > > > new parquet file containing the Parquet geometry metadata. All the
> > > > processing was done using GDAL with its default settings. Due to GDAL's
> > > > overwhelming prevalence in processing spatial data we believe this is a
> > > > pretty good representation of what we expect to see in real world use
> > > > cases. The output file maintains the row ordering of the input. This is
> > > > somewhat important to note because Overture data is internally
> > partitioned,
> > > > placing co-located geometries into the same Row Group (I believe based
> > on
> > > > their level 3 GeoHash), however they are not "sorted" in a traditional
> > > > sense. GDAL defaults to blindly truncating Row Groups to 65536 rows.
> > All
> > > > this is to say that while the test fixture is generally "well formed"
> > > > spatially, it doesn't represent a solution optimized for page-level
> > > > pruning.
> > > >
> > > > 1. Storage overhead/compression ratio:
> > > > I don't have specific measurements now, but I can provide exact
> > numbers for
> > > > this case if needed. I noted that the page statistics are "simulated"
> > > > because I don't actually have a prototype implementation to write them
> > to
> > > > the file. This initial effort just collects the data that would exist
> > in a
> > > > page index (covering bbox) for each page and stores it for use during a
> > > > scan. For discussion's sake, we can do some quick "napkin math" in the
> > mean
> > > > time. The current spatial statistics bbox is 4 required doubles and 4
> > > > optional doubles. Unless I'm mistaken, this should yield 32 to 64
> > bytes per
> > > > page in the metadata. This test fixture has 2D polygons, so it will
> > use the
> > > > 32 byte bbox. Approximately 1,800 pages would result in 57,600 bytes.
> > For a
> > > > file that's about 1.1GB, adding the addition of the bbox in the page
> > > > statistics would increase the file size by about 0.00005%. The
> > "geometry"
> > > > column accounts for the bulk of the file's compressed size. Row Group
> > > > statistics suggest that the geometry column's compressed size is
> > generally
> > > > between 8MB and 9MB. With 104 Row Groups in the file it's probably
> > safe to
> > > > assume about 850MB of compressed geometry data. Again, if we want more
> > > > exact numbers, let me know and I can provide them in a follow-up.
> > > >
> > > > I think Andrew's points are important here: the writer primarily
> > drives the
> > > > effectiveness of page-level statistics, both in terms of compression
> > ratio
> > > > and pruning potential. I don't feel like this is a geospatial specific
> > > > statement either, though it's probably more applicable to any column
> > > > represented as Binary than to primitive types.
> > > >
> > > > Andrew, regarding your recommendation on how to drive this forward,
> > which I
> > > > believe I should have time to do. Is the request to effectively modify
> > a
> > > > parquet library (I'd use arrow-rs) to actually write and read spatial
> > > > statistics in the page index? I don't expect readers here to have
> > reviewed
> > > > the code I've already provided, but I'm trying to understand how your
> > > > suggestion differs from the benchmarks I've already explored.
> > > >
> > > > -Blake
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Feb 26, 2026 at 4:49 AM Andrew Lamb <[email protected]>
> > > > wrote:
> > > >
> > > > > Personally I think having page level statistics is a good idea and I
> > am
> > > > not
> > > > > sure we need to do a lot more empirical evaluation before doing a
> > POC.
> > > > >
> > > > > I think the overhead of page-level statistics will depend almost
> > entirely
> > > > > on how you configure the parquet writer. For example, if you
> > configure
> > > > the
> > > > > writer for pages with 10000 of GEO points, the overhead of statistics
> > > > will
> > > > > be much lower than if you configure the writer with pages that have
> > 100
> > > > > points.
> > > > >
> > > > > The performance benefit of having such indexes will depend heavily
> > on how
> > > > > the data is distributed among the pages and what the query predicate
> > is,
> > > > > which will determine how much pruning is effective.
> > > > >
> > > > > I am not surprised there are reasonable usecases where page level
> > > > > statistics make a substantial difference (which is what Blake
> > appears to
> > > > > have shown with the benchmark).
> > > > >
> > > > > My personal suggestion for anyone who is interested in driving this
> > > > forward
> > > > > is:
> > > > > 1. Create a proof of concept (add the relevant statistics to the
> > > > PageIndex)
> > > > > 2. Demonstrate real world performance gains in a plausible benchmark
> > > > >
> > > > > Ideally the proof of concept would be enough to run the other
> > experiments
> > > > > suggested by Gang and Arnav.
> > > > >
> > > > > Andrew
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 26, 2026 at 4:37 AM Gang Wu <[email protected]> wrote:
> > > > >
> > > > > > Thanks Blake for bringing this up!
> > > > > >
> > > > > > When we were adding the geospatial logical types, page-level geo
> > stats
> > > > > were
> > > > > > not
> > > > > > added by purpose to avoid storage bloat before real world use cases
> > > > > appear.
> > > > > >
> > > > > > I agree with Arnav that we may need more concrete data to justify
> > it.
> > > > > >
> > > > > > Best,
> > > > > > Gang
> > > > > >
> > > > > > On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Blake,
> > > > > > >
> > > > > > > Thanks for sharing the benchmarks, the results look quite
> > compelling.
> > > > > > Page
> > > > > > > level pruning seems like a promising direction, I had a couple of
> > > > > > > questions:
> > > > > > >
> > > > > > > 1. Storage overhead/compression ratio:
> > > > > > > Do you have measurements on the storage overhead by page level
> > stats
> > > > in
> > > > > > > this benchmark? In particular, for the 1800 pages in the geometry
> > > > > column:
> > > > > > >
> > > > > > >    - What was the approximate per page metadata size
> > > > > > >    - Did you observe any impact on compression ratio/file size
> > > > compared
> > > > > > to
> > > > > > >    the baseline?
> > > > > > >
> > > > > > > 2. Dataset:
> > > > > > > Could you share more details on how the test fixture was
> > derived? It
> > > > > > > appears to be based on an Overture dataset, it would be helpful
> > to
> > > > > > > understand:
> > > > > > >
> > > > > > >    - What themes the data was drawn from (buildings, places,
> > > > > > >    transportation, etc)
> > > > > > >    - Does this represent a specific region, and the mix of
> > geometry
> > > > > types
> > > > > > >    present.
> > > > > > >
> > > > > > > Additionally, have you considered evaluating this on other real
> > world
> > > > > > > datasets like OpenStreetMap to understand how the performance
> > varies
> > > > on
> > > > > > > different data/spatial characteristics?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Arnav
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]>
> > wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > I would like to start a discussion on allowing Page level
> > > > statistics
> > > > > > for
> > > > > > > > the new GEO column types.
> > > > > > > >
> > > > > > > > If I understand correctly, the discussion during the
> > formalization
> > > > of
> > > > > > GEO
> > > > > > > > types initially included page-level statistics. However, the
> > > > decision
> > > > > > was
> > > > > > > > made to only allow Row Group level statistics because there
> > was no
> > > > > > > > compelling evidence that Page statistics would meaningfully
> > impact
> > > > > > query
> > > > > > > > performance enough to offset their potential impact on file
> > size.
> > > > > Some
> > > > > > > > discussions with other members of the GeoParquet community
> > prompted
> > > > > me
> > > > > > to
> > > > > > > > build some benchmarks exploring the effect a specialized
> > spatial
> > > > > index
> > > > > > > > could have on query performance.The benchmarks explored the
> > > > > > > > differences between three cases: a base Parquet file that
> > allows
> > > > Row
> > > > > > > Group
> > > > > > > > level pruning, a case simulating Page level pruning using
> > simple
> > > > flat
> > > > > > > > statistics (like the standard Parquet page stats structures),
> > and
> > > > > > > finally a
> > > > > > > > case simulating Page level pruning using a specialized spatial
> > > > index.
> > > > > > The
> > > > > > > > simple flat statistics performed nearly identically to the
> > spatial
> > > > > > index,
> > > > > > > > and allowing page-level pruning improved query performance by
> > > > almost
> > > > > 2x
> > > > > > > > over the base Row Group pruning. Considering these results we
> > felt
> > > > > that
> > > > > > > > pursuing a specialized index specifically for GeoParquet is
> > likely
> > > > > > > > unnecessary. Allowing page-level statistics for GEO columns
> > shows
> > > > > > > > meaningful query performance gains.
> > > > > > > >
> > > > > > > > The source to reproduce the benchmarks can be found here, with
> > some
> > > > > > > simple
> > > > > > > > instructions in the README on how to obtain the test fixture
> > and
> > > > run
> > > > > > the
> > > > > > > > benchmarks:
> > > > > > > >
> > https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks
> > > > > > > >
> > > > > > > > The benchmarks leverage a modified version of GeoDatafusion to
> > > > > compute
> > > > > > a
> > > > > > > > relatively selective geometry intersection query, filtering
> > > > > > approximately
> > > > > > > > 3,000 geometries from a test fixture containing over 10,000,000
> > > > rows.
> > > > > > The
> > > > > > > > file itself is approximately 1.1GB in size and has just over
> > 1800
> > > > > pages
> > > > > > > in
> > > > > > > > its geometry column. In this case, allowing statistics on those
> > > > pages
> > > > > > > > should represent minimal file size overhead.
> > > > > > > >
> > > > > > > > If anyone has any additional requests for benchmarks or
> > information
> > > > > on
> > > > > > > the
> > > > > > > > benchmarks provided please let me know!
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > -Blake Orth
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >

Re: [DISCUSS] Allow Page-level GEO statistics

Reply via email to