alamb commented on PR #21828:
URL: https://github.com/apache/datafusion/pull/21828#issuecomment-4337743212

   I think we should take a step back and maybe reconsider what we are trying 
to do -- maybe by defining and documenting somewhere what the intended 
semantics for `OFFSET` are. 
   
   For example, for a query like 
   ```sql
   SELECT * FROM `single_file.parquet` LIMIT 5 OFFSET 600000000
   ```
    I think DataFusion would probably be "correct" to return any 5 rows (as the 
query doesn't specify any `ORDER BY`)
   
   
   However, that is probably not what the user wanted / expected. The user 
probably wants rows starting at logical offset 600000000 of the file.
   
   Likewise, what rows should be returned from  this query (where there are 
multiple files)?
   ```sql
   SELECT * FROM directory_with_multiple_files LIMIT 5 OFFSET 600000000
   ```
   That is not clear to me - is there some sort of implied global order of rows 
within the file that users expect?
   
   
   
   Another question: **Do we expect DataFusion to always produce the same 
values for a given LIMIT / OFFSET combination?** (Note that today it will 
potentially return different values for the same `LIMIT` when there are more 
than one partition and no `ORDER BY` clause)
   
   
   > > Thanks @zhuqi-lucas -- I took a look and this looks pretty nice. I think 
we can potentially simplify the APIs a bit though and I left some suggestions 
on how to do so
   > > let me know what you think
   > 
   > Thanks for the review @alamb! I agree the current API is confusing. Let me 
explain the design intent:
   > 
   > There are two cases for offset + parquet:
   > 
   > **Case 1: No filter** → offset is fully handled (raw row counts are 
exact). Safe to eliminate `GlobalLimitExec`.
   
   I think this makes sense and it is the case that @AntoinePrv gives in 
https://github.com/apache/datafusion/issues/19654
   
   
   > **Case 2: Filter + fully matched RGs** → `prune_by_statistics` marks RGs 
where ALL rows satisfy the filter (`is_fully_matched`). For these RGs, row 
counts are still exact — safe to skip for offset. But non-fully-matched RGs 
need `GlobalLimitExec` to handle remaining offset.
   
   I am confused about how this is intended to work. For example in your 
example `AFTER (with WHERE)`,  what if the skipped row groups wouldn't have 
been skipped of the offset had been applied as the file was read? For example, 
what if the last row group was entirely skipped but wouldn't have made it into 
the results -- should it be included in the offset calculation?
   
   
   Also, I think the GlobalLimitExec needs to be updated
   
   ```
                   AFTER (with WHERE)
                   ==================
   
     GlobalLimitExec(skip=59M, fetch=5)    <---- shouldn't the skip be reduced 
by the number of rows
       │                                         that are skipped with full 
matching RGs? How does that work with multiple files 🤔 
       ▼                                         
     DataSourceExec(limit=59M+5, offset=59M)
       │
       ▼ prune_by_offset
     Skip fully-matched RGs (WHERE stats  <---- how do you know what order 
these are? Is it only the first ones?
     prove all rows match → row count exact)
       │
       ▼
     Read remaining RGs (with WHERE filter)
       │
       ▼
     GlobalLimitExec handles remaining skip
   ```
   
   
   Given the potential complexity of handling / optimizing the general case, I 
wonder if there is some way to start simple (maybe a PR that only implements 
case 1, when there are no predicates, and there is a clearer definition of what 
the expected results should be


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to