Hi Milan, A short answer is that the current language of the spec does not forbid writing "OGC:CRS84" to the CRS field (which is "just a string" as far as thrift is concerned). All existing readers that I know about (DuckDB, arrow-rs, Arrow C++, GDAL) will accept that string and interpret it unambiguously on read (for example, `GeoPandas.from_arrow(pyarrow.parquet.read_table(...))` works). There is also an example file in parquet-testing that covers this case (arbitrary string that is neither of the recommended options) [1]. I put together a small example script to demonstrate the read path for the tools I mentioned [2].
Jia is correct that the GeoParquet community will require writing an inline PROJJSON string in the forthcoming 2.0 version of the specification [3]. This was a pragmatic decision that reflects the needs of existing GeoParquet users because: - srid does not explicitly name the EPSG database, so any code written there does not have an unambiguous interpretation (even if it did it would place ambiguous licencing and/or dependency requirements on consumers) - projjson:some_field was not pragmatic to implement on the write side for either of the implementations I was involved in (C++ and Rust). Implementations just don't expose the global key/value metadata when converting types and doing so would have required breaking changes in the APIs. There are also ambiguities with respect to existing propagation of schema metadata (i.e., the projjson schema key is often propagated in unexpected ways into pyarrow and beyond, including being written into the key/value metadata of a resulting Parquet file). As a result, most of the tools that can write GEOMETRY and GEOGRAPHY (Arrow C++, GDAL, arrow-rs are currently writing inline strings (because inline strings are what is available in the representation passed to Arrow-based writers and this was better than omitting CRS information). For all the implementations I was involved in, we also try to explicitly omit the CRS when we detect that the string we were passed is lon/lat (i.e., if they see "OGC:CRS84", they write an omitted CRS to minimize the need for consumers to be CRS aware). I'll echo Jia's comment that none of us are keen to reopen a CRS discussion but I also agree that the current language of the spec is vague and doesn't reflect the reality of the ecosystem as it has evolved. I'm happy to review any PRs to improve the language or implementations :) Cheers, -dewey [1] https://github.com/apache/parquet-testing/tree/master/data/geospatial#geospatial-test-files [2] https://gist.github.com/paleolimbot/7759e58bf1f98ecf8f2c459367bbdeda [3] https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#crs-parquet-property On Wed, Mar 25, 2026 at 12:49 AM Jia Yu <[email protected]> wrote: > > Hi Milan, > > The authority:identifier pattern was explicitly rejected in prior > community discussions. The core concern is that it forces query > engines to rely on external registries to resolve CRS definitions, > which breaks the goal of self-contained data. More importantly, the > most widely used authority, the EPSG database, comes with licensing > terms that are not particularly open-source friendly: > https://epsg.org/terms-of-use.html > > As a result, the community has leaned toward requiring data writers to > use a fully self-contained CRS representation such as PROJJSON. In > that model, a reference like OGC:CRS84 is understood to map directly > to its corresponding PROJJSON definition, as outlined in the > GeoParquet specification: > https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details > > That said, this expectation is not clearly spelled out in the Parquet > and Iceberg specifications, which leaves some ambiguity in practice. > > I don’t have a strong stance either way. In fact, I can see the case > for allowing authority:identifier. But it’s worth noting that > introducing it now would likely reopen a fairly contentious discussion > in the community. > > Jia > > On Tue, Mar 24, 2026 at 10:09 AM Milan Stefanovic > <[email protected]> wrote: > > > > Hi everyone, > > > > I’m looking for some clarification (and potentially a small spec update) > > regarding the Geospatial Physical Types documentation - > > https://parquet.apache.org/docs/file-format/types/geospatial/, specifically > > the CRS Customization section. > > > > 1) The Confusion > > > > Currently, the spec states that custom CRS values should follow the > > `type:identifier` format, where type is either `srid` or `projjson` - > > (e.g., `srid:4326` or `projjson:property_name`). The spec also defines the > > default CRS as `OGC:CRS84`. > > > > Depending on how the specification is read, the reader may consider as > > valid CRS definition to be only strings of the form `srid:<some number>` or > > `projjson:<property name>`, which implies that `OGC:CRS84` does not adhere > > to the rules defined in the customization section. This creates confusion > > for implementers: should the type string always be parsed as a strict > > "custom" format which necessitates the srid: prefix? > > > > 2) The Suggestion > > > > I suggest we update the language to be explicit about allowed formats for > > CRS, and my suggestion is that we break it down like this: > > - Standard CRS: Any string from a known authority in a format of > > `<authority>:<identifier>` (e.g., `EPSG:4326`, `OGC:CRS84`, `ESRI:102100`) > > is accepted. > > - Custom CRS: in the format of `type:identifier` > > - `srid:1234`: The definition resides in a local/database spatial > > reference table. > > - `projjson:key`: The definition is stored in Parquet file/table > > metadata. > > > > This would validate `OGC:CRS84` as a first-class string while providing a > > clear "escape hatch" for custom definitions. > > > > What are your thoughts ? > > > > Kind regards, > > Milan
