AntoinePrv commented on PR #21828:
URL: https://github.com/apache/datafusion/pull/21828#issuecomment-4341758693

   > ```sql
   > SELECT * FROM `single_file.parquet` LIMIT 5 OFFSET 600000000
   > ```
   > 
   > I think DataFusion would probably be "correct" to return any 5 rows (as 
the query doesn't specify any `ORDER BY`)
   > 
   > However, that is probably not what the user wanted / expected. The user 
probably wants rows starting at logical offset 600000000 of the file.
   
   From a user perspective, I very much expect to be returned the rows in the 
logical order of the file (alternatively is there a clause to express it as an 
`ORDER BY`?). I understand this may not what someone coming from databases may 
expect. But coming more from a data science, one may consider Parquet as a 
"smart CSV" and wish to investigate the "last 1000 row they just generated" 
(*ie* bottom of the file). While data scientist may not use DataFusion 
directly, they may absolutely write queries on systems build with it.
   
   > Likewise, what rows should be returned from this query (where there are 
multiple files)?
   > 
   > ```sql
   > SELECT * FROM directory_with_multiple_files LIMIT 5 OFFSET 600000000
   > ```
   > 
   > That is not clear to me - is there some sort of implied global order of 
rows within the file that users expect?
   
   I do not have as strong an expectation here. Best case scenario I would say 
logical order is determined by something like lexicographic order of the file 
paths, but it does feels a bit more arbitrary. Though I would expect that there 
is some arbitrary order (possibly hidden from me) and that changing the offset 
changes the rows I am viewing in that hidden order. That is
   - The same query twice returns the same rows
   - I am able to observe all the rows by changing the offset.
   
   ------
   
   Perhaps I can say a bit more on my use case. Basically I want to paginate 
rows from a dataset to show some user.
   For the single file case, this is like file viewer so I do prefer to have a 
way to express logical order (either explicitly or implicitly).
   For the multiple file scenario, I believe the order not to be as important, 
but I do need a way to get "chunks" of rows that will cover the whole dataset 
without duplication, while retrieving a single "chunk" with maximum efficiency 
(e.g. most of the time they are contiguous rows in a from a single file).
   
   I think the "efficiently paginate rows" can be considered a reasonable 
feature beyond my own use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to