AntoinePrv commented on PR #21828: URL: https://github.com/apache/datafusion/pull/21828#issuecomment-4341758693
> ```sql > SELECT * FROM `single_file.parquet` LIMIT 5 OFFSET 600000000 > ``` > > I think DataFusion would probably be "correct" to return any 5 rows (as the query doesn't specify any `ORDER BY`) > > However, that is probably not what the user wanted / expected. The user probably wants rows starting at logical offset 600000000 of the file. From a user perspective, I very much expect to be returned the rows in the logical order of the file (alternatively is there a clause to express it as an `ORDER BY`?). I understand this may not what someone coming from databases may expect. But coming more from a data science, one may consider Parquet as a "smart CSV" and wish to investigate the "last 1000 row they just generated" (*ie* bottom of the file). While data scientist may not use DataFusion directly, they may absolutely write queries on systems build with it. > Likewise, what rows should be returned from this query (where there are multiple files)? > > ```sql > SELECT * FROM directory_with_multiple_files LIMIT 5 OFFSET 600000000 > ``` > > That is not clear to me - is there some sort of implied global order of rows within the file that users expect? I do not have as strong an expectation here. Best case scenario I would say logical order is determined by something like lexicographic order of the file paths, but it does feels a bit more arbitrary. Though I would expect that there is some arbitrary order (possibly hidden from me) and that changing the offset changes the rows I am viewing in that hidden order. That is - The same query twice returns the same rows - I am able to observe all the rows by changing the offset. ------ Perhaps I can say a bit more on my use case. Basically I want to paginate rows from a dataset to show some user. For the single file case, this is like file viewer so I do prefer to have a way to express logical order (either explicitly or implicitly). For the multiple file scenario, I believe the order not to be as important, but I do need a way to get "chunks" of rows that will cover the whole dataset without duplication, while retrieving a single "chunk" with maximum efficiency (e.g. most of the time they are contiguous rows in a from a single file). I think the "efficiently paginate rows" can be considered a reasonable feature beyond my own use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
