alamb commented on PR #21828:
URL: https://github.com/apache/datafusion/pull/21828#issuecomment-4337743212
I think we should take a step back and maybe reconsider what we are trying
to do -- maybe by defining and documenting somewhere what the intended
semantics for `OFFSET` are.
For example, for a query like
```sql
SELECT * FROM `single_file.parquet` LIMIT 5 OFFSET 600000000
```
I think DataFusion would probably be "correct" to return any 5 rows (as the
query doesn't specify any `ORDER BY`)
However, that is probably not what the user wanted / expected. The user
probably wants rows starting at logical offset 600000000 of the file.
Likewise, what rows should be returned from this query (where there are
multiple files)?
```sql
SELECT * FROM directory_with_multiple_files LIMIT 5 OFFSET 600000000
```
That is not clear to me - is there some sort of implied global order of rows
within the file that users expect?
Another question: **Do we expect DataFusion to always produce the same
values for a given LIMIT / OFFSET combination?** (Note that today it will
potentially return different values for the same `LIMIT` when there are more
than one partition and no `ORDER BY` clause)
> > Thanks @zhuqi-lucas -- I took a look and this looks pretty nice. I think
we can potentially simplify the APIs a bit though and I left some suggestions
on how to do so
> > let me know what you think
>
> Thanks for the review @alamb! I agree the current API is confusing. Let me
explain the design intent:
>
> There are two cases for offset + parquet:
>
> **Case 1: No filter** → offset is fully handled (raw row counts are
exact). Safe to eliminate `GlobalLimitExec`.
I think this makes sense and it is the case that @AntoinePrv gives in
https://github.com/apache/datafusion/issues/19654
> **Case 2: Filter + fully matched RGs** → `prune_by_statistics` marks RGs
where ALL rows satisfy the filter (`is_fully_matched`). For these RGs, row
counts are still exact — safe to skip for offset. But non-fully-matched RGs
need `GlobalLimitExec` to handle remaining offset.
I am confused about how this is intended to work. For example in your
example `AFTER (with WHERE)`, what if the skipped row groups wouldn't have
been skipped of the offset had been applied as the file was read? For example,
what if the last row group was entirely skipped but wouldn't have made it into
the results -- should it be included in the offset calculation?
Also, I think the GlobalLimitExec needs to be updated
```
AFTER (with WHERE)
==================
GlobalLimitExec(skip=59M, fetch=5) <---- shouldn't the skip be reduced
by the number of rows
│ that are skipped with full
matching RGs? How does that work with multiple files 🤔
▼
DataSourceExec(limit=59M+5, offset=59M)
│
▼ prune_by_offset
Skip fully-matched RGs (WHERE stats <---- how do you know what order
these are? Is it only the first ones?
prove all rows match → row count exact)
│
▼
Read remaining RGs (with WHERE filter)
│
▼
GlobalLimitExec handles remaining skip
```
Given the potential complexity of handling / optimizing the general case, I
wonder if there is some way to start simple (maybe a PR that only implements
case 1, when there are no predicates, and there is a clearer definition of what
the expected results should be
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]