alamb commented on PR #22000: URL: https://github.com/apache/datafusion/pull/22000#issuecomment-4374073364
> DataFusion has the machinery for fine-grained parquet sampling (ParquetAccessPlan with Skip / Scan / Selection(RowSelection)) but no public way to ask for a sample without constructing the access plan by hand and stuffing it into PartitionedFile.extensions, and no SQL surface at all. That works for one-off code but is awkward for: My personal rationale was that all the different SQL systems did sampling differently -- so any particular choice for sampling is probably fine but I wasn't at all sure there would be enough commonality across implementations to put it into DataFusion core ALso, what is the problem with constructing the access plan by hand? For this type of low level access pattern (particular sampling methods etc) it seems like low level construction is just the escape valve that is needed (super fine grained control) I am very wary of complicating the built in Parquet reader any more -- it is already very complicated with lots of behaviros (and new ones getting added all ghe time, for example the sortedness ones from @zhuqi-lucas and @xudong963 ) So adding APIs to make it easier to extend / modify plans makes sense to me, but hard coding more sampling into the core is much less clear to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
