> path_in_schema being optional. Making it optional doesn't avoid a breaking
> change - it is a breaking change. Every existing reader uses path_in_schema
> to reconstruct the column→schema mapping. The moment writers stop emitting
> it, old readers break. Same ecosystem coordination cost as a new footer
> format. If we're paying that cost, we should get more than one fewer field
> for it.

Well, there are breaking changes and then there are breaking changes. ;-) I 
think the thrash from removing a single field is much more manageable than a 
second embedded version of the metadata. And I never said it wasn't a breaking 
change, but it's no more so than adding a new encoding...at the end of the day 
old readers won't be able to read either.
 
> 
> Moving forward. There is a tradeoff between complexity and scope. I see
> three paths:
> 
> A. Ship the full FlatBuffer footer as proposed. Move forward with PR #544
> as-is, logically compatible with the Thrift footer - schema, column chunk
> metadata, statistics, page indexes, encryption, all of it. One transition,
> one spec. Risk: the scope keeps generating debate and we
> stay stuck.
> 
> B. Ship a minimal FlatBuffer core, add modules later. Strip the FlatBuffer
> footer to schema + column chunk placement (file offset, compressed size,
> uncompressed size) - the minimum a reader needs to plan I/O. Statistics,
> size statistics, page indexes, encryption become separate
> optional FlatBuffer modules that live before the footer and are referenced
> by pointer from the core. Ratify the core now, add modules as independent
> work streams. This unblocks the part everyone agrees on and lets us iterate
> on the contentious pieces without re-litigating the core.
> 
> C. Improve statistics and page indexes within the current format. Hold off
> on the FlatBuffer footer. Focus on smarter writer defaults, tooling like
> parquet-linter, and Will's jump table for O(1) access to existing files. No
> format break, but we accept the structural limitations of Thrift.
> 
> My preference is A or B, whichever lands faster.

I think I'd prefer B. If we're going through the effort, rather than just 
putting flatbuffer lipstick on the current metadata pig, I was hoping we'd 
explore a complete rethink of the metadata with an eye to making it more 
modular to support more experimentation. I like the idea of breaking it up into 
typed sections, with file navigation info living in one place, schema another, 
indexes yet another, things to be dreamt up etc. I think there's sufficient 
interest and momentum to avoid the stagnation Raphael fears.

Anyway, I think we've come to a consensus that it's worth pursuing the skip 
index irrespective of the flatbuffer work. Perhaps Will and I can collaborate 
and merge our PoCs and make a more formal proposal.

I think the flatbuffer discussion should likely continue on the PR [1].

Thanks all for the lively discussion!

Ed

[1] https://github.com/apache/parquet-format/pull/544

Reply via email to