> path_in_schema being optional. Making it optional doesn't avoid a breaking > change - it is a breaking change. Every existing reader uses path_in_schema > to reconstruct the column→schema mapping. The moment writers stop emitting > it, old readers break. Same ecosystem coordination cost as a new footer > format. If we're paying that cost, we should get more than one fewer field > for it.
Well, there are breaking changes and then there are breaking changes. ;-) I think the thrash from removing a single field is much more manageable than a second embedded version of the metadata. And I never said it wasn't a breaking change, but it's no more so than adding a new encoding...at the end of the day old readers won't be able to read either. > > Moving forward. There is a tradeoff between complexity and scope. I see > three paths: > > A. Ship the full FlatBuffer footer as proposed. Move forward with PR #544 > as-is, logically compatible with the Thrift footer - schema, column chunk > metadata, statistics, page indexes, encryption, all of it. One transition, > one spec. Risk: the scope keeps generating debate and we > stay stuck. > > B. Ship a minimal FlatBuffer core, add modules later. Strip the FlatBuffer > footer to schema + column chunk placement (file offset, compressed size, > uncompressed size) - the minimum a reader needs to plan I/O. Statistics, > size statistics, page indexes, encryption become separate > optional FlatBuffer modules that live before the footer and are referenced > by pointer from the core. Ratify the core now, add modules as independent > work streams. This unblocks the part everyone agrees on and lets us iterate > on the contentious pieces without re-litigating the core. > > C. Improve statistics and page indexes within the current format. Hold off > on the FlatBuffer footer. Focus on smarter writer defaults, tooling like > parquet-linter, and Will's jump table for O(1) access to existing files. No > format break, but we accept the structural limitations of Thrift. > > My preference is A or B, whichever lands faster. I think I'd prefer B. If we're going through the effort, rather than just putting flatbuffer lipstick on the current metadata pig, I was hoping we'd explore a complete rethink of the metadata with an eye to making it more modular to support more experimentation. I like the idea of breaking it up into typed sections, with file navigation info living in one place, schema another, indexes yet another, things to be dreamt up etc. I think there's sufficient interest and momentum to avoid the stagnation Raphael fears. Anyway, I think we've come to a consensus that it's worth pursuing the skip index irrespective of the flatbuffer work. Perhaps Will and I can collaborate and merge our PoCs and make a more formal proposal. I think the flatbuffer discussion should likely continue on the PR [1]. Thanks all for the lively discussion! Ed [1] https://github.com/apache/parquet-format/pull/544
