Re: [PR] feat: TABLESAMPLE SYSTEM end-to-end + row-group / row sampling on ParquetSource [datafusion]

via GitHub Mon, 04 May 2026 13:19:35 -0700


adriangb commented on PR #22000:
URL: https://github.com/apache/datafusion/pull/22000#issuecomment-4374194293


   > > > I am very wary of complicating the built in Parquet reader any more -- 
it is already very complicated with lots of behaviros (and new ones getting 
added all ghe time, for example the sortedness ones from @zhuqi-lucas and 
@xudong963 )
   > > 
   > > 
   > > I agree it is a complex piece of software but I think we can continue to 
add the right abstractions and simplifications (like you recently did with the 
moralization work 😄 ). Ultimately the file reader is going to be a key piece of 
a data toolkit like DataFusion so it's unsurprising (to me) that it holds a lot 
of the complexity.
   > 
   > yeah -- maybe I am over sensitive as I feel like as soon as we are able to 
refactor away some of the complexity then it get all complicated again 😆
   
   No you are right: it is a big risk that this code turns into feature 
spaghetti. It's just not one I think we can necessarily avoid. We should be 
*cautious* about introducing complexity and push back (like you have here) but 
if this is the right place to put it and we can factor it into a shape that 
only adds complexity, not multiplies or exponentiates it, then maybe we just 
need to deal with it over time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: TABLESAMPLE SYSTEM end-to-end + row-group / row sampling on ParquetSource [datafusion]

Reply via email to