Thank you Blake for putting this together! I left some minor comments on the PR [1]...I think the PR nicely demonstrates that this change is largely self-contained and doesn't affect non-geometry/geography code paths. I believe this should also not affect the metadata sizes of files that don't contain geometry columns but I didn't check.
Two things I will try to do in the next few weeks to help move this forward are (1) put together a similar POC in C++ and (2) use your PR to rewrite the geoarrow-data example files [2], which might give a wider corpus of input on which to assert that this really does help with the queries. Cheers, -dewey [1] https://github.com/BlakeOrth/arrow-rs/pull/1 [2] https://geoarrow.org/data On Wed, Apr 1, 2026 at 4:08 PM Blake Orth <[email protected]> wrote: > > Hi all, > > My apologies for the delay in getting around to this. Unfortunately, most > of the time open-source contributions can't be my top priority so they're > more of an "as I have time" situation (bummer). However, I did end up > getting a bit of breathing room and was able to implement a POC (at POC > quality, not PR quality) for writing geospatial statistics at the page > level. Feel free to look/comment/review or run the implementation here: > https://github.com/BlakeOrth/arrow-rs/pull/1 > > As the description notes, comparing things like file size between the POC > and the existing implementation is as simple as switching branches back to > `main` and running the exact same commands. Since size overhead was an > initial point of concern for this effort, I took the liberty to get some > numbers for that up front. I've used the same test fixture that was used > for my previously noted query benchmarks tokeep things consistent. This is > also the same file that's referenced in the PR linked above. I first > rewrote the test fixture using `arrow-rs/parquet` just to ensure > consistency in the comparison between parquet writer settings (page size, > compression level etc.). > > re-written test fixture on `main`: 1138395788 bytes > re-written test fixture with POC page stats: 1138457540 bytes > > This means the addition of page level stats for this file results in a file > that is 61752 bytes larger, an increase of 0.0054%. > > I'm looking forward to hearing everyone's thoughts on this. I believe there > was also a request to have some additional benchmarks that are a bit easier > to run, which I can start looking into. > > -Blake > > On Fri, Mar 6, 2026 at 8:00 PM Dewey Dunnington <[email protected]> > wrote: > > > Hi Blake et al., > > > > Thank you for driving this and apologies for arriving late to this > > thread. I don't have much to add on top of what you've already written > > here...I am excited that we have gotten to a point with Parquet and > > geometry/geography that we're seeing interest at optimizing at this > > level. I don't think anybody who has worked with page statistics would > > be surprised that small range queries can be made much more performant > > when page statistics are available. Small range queries for maps > > (e.g., zooming around an interactive map!) are currently better served > > by other formats that can embed a row-level spatial index and I think > > this proposal will expand the utility of Parquet to make it > > competitive in those situations as well. > > > > In addition to the speed and potential reduction in IO for small range > > queries, I think this will be a useful change because it helps unify > > the spatial handling with other pruning code that operates on > > page-level statistics. > > > > I'm happy to implement this in C++ when the time comes or do a wider > > set of experiments if the community has specific concerns! > > > > Cheers, > > > > -dewey > > > > On Sun, Mar 1, 2026 at 6:04 AM Andrew Lamb <[email protected]> wrote: > > > > > > > Andrew, regarding your recommendation on how to drive this forward, > > which > > > I > > > > believe I should have time to do. Is the request to effectively modify > > a > > > > parquet library (I'd use arrow-rs) to actually write and read spatial > > > > statistics in the page index? I don't expect readers here to have > > reviewed > > > > the code I've already provided, but I'm trying to understand how your > > > > suggestion differs from the benchmarks I've already explored. > > > > > > Yes > > > > > > I suggest modifying an existing system with the proposed changes so that > > > 1. It is 100% clear what is proposed (we can review the code) > > > 2. We don't have to speculate about the potential costs / benefits (we > > can > > > instead measure them directly with the proposal) > > > > > > > > > > > > > > > On Thu, Feb 26, 2026 at 4:33 PM Blake Orth <[email protected]> wrote: > > > > > > > Thanks for the engagement, everyone. I'm glad there's generally some > > > > interest in exploring this idea. > > > > > > > > Arnav, to address your questions I'll actually address the 2nd one > > first as > > > > it adds some overall context to the discussion. > > > > 2. Dataset > > > > As you noted, the test fixture was derived from Overture Maps > > Foundation > > > > GeoParquet data. It is a single file selected from the building > > footprints > > > > partition. This means it represents a collection of 2D polygons and > > their > > > > associated properties. The processing dropped the "GeoParquet" specific > > > > data (key-value metadata and dedicated bbox "covering columns") and > > wrote a > > > > new parquet file containing the Parquet geometry metadata. All the > > > > processing was done using GDAL with its default settings. Due to GDAL's > > > > overwhelming prevalence in processing spatial data we believe this is a > > > > pretty good representation of what we expect to see in real world use > > > > cases. The output file maintains the row ordering of the input. This is > > > > somewhat important to note because Overture data is internally > > partitioned, > > > > placing co-located geometries into the same Row Group (I believe based > > on > > > > their level 3 GeoHash), however they are not "sorted" in a traditional > > > > sense. GDAL defaults to blindly truncating Row Groups to 65536 rows. > > All > > > > this is to say that while the test fixture is generally "well formed" > > > > spatially, it doesn't represent a solution optimized for page-level > > > > pruning. > > > > > > > > 1. Storage overhead/compression ratio: > > > > I don't have specific measurements now, but I can provide exact > > numbers for > > > > this case if needed. I noted that the page statistics are "simulated" > > > > because I don't actually have a prototype implementation to write them > > to > > > > the file. This initial effort just collects the data that would exist > > in a > > > > page index (covering bbox) for each page and stores it for use during a > > > > scan. For discussion's sake, we can do some quick "napkin math" in the > > mean > > > > time. The current spatial statistics bbox is 4 required doubles and 4 > > > > optional doubles. Unless I'm mistaken, this should yield 32 to 64 > > bytes per > > > > page in the metadata. This test fixture has 2D polygons, so it will > > use the > > > > 32 byte bbox. Approximately 1,800 pages would result in 57,600 bytes. > > For a > > > > file that's about 1.1GB, adding the addition of the bbox in the page > > > > statistics would increase the file size by about 0.00005%. The > > "geometry" > > > > column accounts for the bulk of the file's compressed size. Row Group > > > > statistics suggest that the geometry column's compressed size is > > generally > > > > between 8MB and 9MB. With 104 Row Groups in the file it's probably > > safe to > > > > assume about 850MB of compressed geometry data. Again, if we want more > > > > exact numbers, let me know and I can provide them in a follow-up. > > > > > > > > I think Andrew's points are important here: the writer primarily > > drives the > > > > effectiveness of page-level statistics, both in terms of compression > > ratio > > > > and pruning potential. I don't feel like this is a geospatial specific > > > > statement either, though it's probably more applicable to any column > > > > represented as Binary than to primitive types. > > > > > > > > Andrew, regarding your recommendation on how to drive this forward, > > which I > > > > believe I should have time to do. Is the request to effectively modify > > a > > > > parquet library (I'd use arrow-rs) to actually write and read spatial > > > > statistics in the page index? I don't expect readers here to have > > reviewed > > > > the code I've already provided, but I'm trying to understand how your > > > > suggestion differs from the benchmarks I've already explored. > > > > > > > > -Blake > > > > > > > > > > > > > > > > > > > > On Thu, Feb 26, 2026 at 4:49 AM Andrew Lamb <[email protected]> > > > > wrote: > > > > > > > > > Personally I think having page level statistics is a good idea and I > > am > > > > not > > > > > sure we need to do a lot more empirical evaluation before doing a > > POC. > > > > > > > > > > I think the overhead of page-level statistics will depend almost > > entirely > > > > > on how you configure the parquet writer. For example, if you > > configure > > > > the > > > > > writer for pages with 10000 of GEO points, the overhead of statistics > > > > will > > > > > be much lower than if you configure the writer with pages that have > > 100 > > > > > points. > > > > > > > > > > The performance benefit of having such indexes will depend heavily > > on how > > > > > the data is distributed among the pages and what the query predicate > > is, > > > > > which will determine how much pruning is effective. > > > > > > > > > > I am not surprised there are reasonable usecases where page level > > > > > statistics make a substantial difference (which is what Blake > > appears to > > > > > have shown with the benchmark). > > > > > > > > > > My personal suggestion for anyone who is interested in driving this > > > > forward > > > > > is: > > > > > 1. Create a proof of concept (add the relevant statistics to the > > > > PageIndex) > > > > > 2. Demonstrate real world performance gains in a plausible benchmark > > > > > > > > > > Ideally the proof of concept would be enough to run the other > > experiments > > > > > suggested by Gang and Arnav. > > > > > > > > > > Andrew > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 26, 2026 at 4:37 AM Gang Wu <[email protected]> wrote: > > > > > > > > > > > Thanks Blake for bringing this up! > > > > > > > > > > > > When we were adding the geospatial logical types, page-level geo > > stats > > > > > were > > > > > > not > > > > > > added by purpose to avoid storage bloat before real world use cases > > > > > appear. > > > > > > > > > > > > I agree with Arnav that we may need more concrete data to justify > > it. > > > > > > > > > > > > Best, > > > > > > Gang > > > > > > > > > > > > On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hi Blake, > > > > > > > > > > > > > > Thanks for sharing the benchmarks, the results look quite > > compelling. > > > > > > Page > > > > > > > level pruning seems like a promising direction, I had a couple of > > > > > > > questions: > > > > > > > > > > > > > > 1. Storage overhead/compression ratio: > > > > > > > Do you have measurements on the storage overhead by page level > > stats > > > > in > > > > > > > this benchmark? In particular, for the 1800 pages in the geometry > > > > > column: > > > > > > > > > > > > > > - What was the approximate per page metadata size > > > > > > > - Did you observe any impact on compression ratio/file size > > > > compared > > > > > > to > > > > > > > the baseline? > > > > > > > > > > > > > > 2. Dataset: > > > > > > > Could you share more details on how the test fixture was > > derived? It > > > > > > > appears to be based on an Overture dataset, it would be helpful > > to > > > > > > > understand: > > > > > > > > > > > > > > - What themes the data was drawn from (buildings, places, > > > > > > > transportation, etc) > > > > > > > - Does this represent a specific region, and the mix of > > geometry > > > > > types > > > > > > > present. > > > > > > > > > > > > > > Additionally, have you considered evaluating this on other real > > world > > > > > > > datasets like OpenStreetMap to understand how the performance > > varies > > > > on > > > > > > > different data/spatial characteristics? > > > > > > > > > > > > > > Thanks, > > > > > > > Arnav > > > > > > > > > > > > > > > > > > > > > On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]> > > wrote: > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > I would like to start a discussion on allowing Page level > > > > statistics > > > > > > for > > > > > > > > the new GEO column types. > > > > > > > > > > > > > > > > If I understand correctly, the discussion during the > > formalization > > > > of > > > > > > GEO > > > > > > > > types initially included page-level statistics. However, the > > > > decision > > > > > > was > > > > > > > > made to only allow Row Group level statistics because there > > was no > > > > > > > > compelling evidence that Page statistics would meaningfully > > impact > > > > > > query > > > > > > > > performance enough to offset their potential impact on file > > size. > > > > > Some > > > > > > > > discussions with other members of the GeoParquet community > > prompted > > > > > me > > > > > > to > > > > > > > > build some benchmarks exploring the effect a specialized > > spatial > > > > > index > > > > > > > > could have on query performance.The benchmarks explored the > > > > > > > > differences between three cases: a base Parquet file that > > allows > > > > Row > > > > > > > Group > > > > > > > > level pruning, a case simulating Page level pruning using > > simple > > > > flat > > > > > > > > statistics (like the standard Parquet page stats structures), > > and > > > > > > > finally a > > > > > > > > case simulating Page level pruning using a specialized spatial > > > > index. > > > > > > The > > > > > > > > simple flat statistics performed nearly identically to the > > spatial > > > > > > index, > > > > > > > > and allowing page-level pruning improved query performance by > > > > almost > > > > > 2x > > > > > > > > over the base Row Group pruning. Considering these results we > > felt > > > > > that > > > > > > > > pursuing a specialized index specifically for GeoParquet is > > likely > > > > > > > > unnecessary. Allowing page-level statistics for GEO columns > > shows > > > > > > > > meaningful query performance gains. > > > > > > > > > > > > > > > > The source to reproduce the benchmarks can be found here, with > > some > > > > > > > simple > > > > > > > > instructions in the README on how to obtain the test fixture > > and > > > > run > > > > > > the > > > > > > > > benchmarks: > > > > > > > > > > https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks > > > > > > > > > > > > > > > > The benchmarks leverage a modified version of GeoDatafusion to > > > > > compute > > > > > > a > > > > > > > > relatively selective geometry intersection query, filtering > > > > > > approximately > > > > > > > > 3,000 geometries from a test fixture containing over 10,000,000 > > > > rows. > > > > > > The > > > > > > > > file itself is approximately 1.1GB in size and has just over > > 1800 > > > > > pages > > > > > > > in > > > > > > > > its geometry column. In this case, allowing statistics on those > > > > pages > > > > > > > > should represent minimal file size overhead. > > > > > > > > > > > > > > > > If anyone has any additional requests for benchmarks or > > information > > > > > on > > > > > > > the > > > > > > > > benchmarks provided please let me know! > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -Blake Orth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
