alamb commented on PR #21828:
URL: https://github.com/apache/datafusion/pull/21828#issuecomment-4347138643

   Thank yoU @zhuqi-lucas  -- I am not sure this change actually solves 
@AntoinePrv's use case  - I think the conservative checks will not trigger on a 
large file
   
   For example, I tested using a 14 GB clickbench parquet file:
   
   ```shell
   cd benchmarks
   ./bench.sh  data clickbench_1
   cd data
   ```
   
   And then run datafusion-cli from this branch:
   ```sql
   select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;
   ```
   
   It took 4seconds on my laptop (to return 5 rows) which I think means this 
branch is not triggered:
   ```shell
   andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks/data$ 
~/Downloads/datafusion-cli-feat_offset-pushdown
   DataFusion CLI v53.1.0
   > select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;
   
+---------------------+------------+-------------------------------------------------------------------------------------+-----------+------------+-----------+-----------+-------------+----------+--------------------+--------------+-----+-----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+-----------+-------------------+-----------------+---------------+-------------+-----------------+------------------+-----------------+------------+------------+-------------+----------+----------+----------------+----------------+--------------+------------------+----------+-------------+------------------+------------------------+-------------+----------------+----------------+--------------+-------------+-------------+-------------------+--------------------+----------------+-----------------+----------
 
-----------+---------------------+---------------------+---------------------+-------------+-------------+--------+------------+-------------+---------------------+-------------+-----------+--------------+---------+-------------+---------------+----------+----------+----------------+-----+-----+--------+-----------+-----------+------------+------------+------------+---------------+-----------------+----------------+---------------+--------------+-----------+------------+-----------+---------------+---------------------+-------------------+-------------+-----------------------+------------------+------------+--------------+---------------+-----------------+---------------------+--------------------+--------------+------------------+-----------+-----------+-------------+------------+---------+---------+----------+----------------------+---------------------+------+
   | WatchID             | JavaEnable | Title                                   
                                            | GoodEvent | EventTime  | 
EventDate | CounterID | ClientIP    | RegionID | UserID             | 
CounterClass | OS  | UserAgent | URL                                            
                                                                          | 
Referer                                                                         
                             | IsRefresh | RefererCategoryID | RefererRegionID 
| URLCategoryID | URLRegionID | ResolutionWidth | ResolutionHeight | 
ResolutionDepth | FlashMajor | FlashMinor | FlashMinor2 | NetMajor | NetMinor | 
UserAgentMajor | UserAgentMinor | CookieEnable | JavascriptEnable | IsMobile | 
MobilePhone | MobilePhoneModel | Params                 | IPNetworkID | 
TraficSourceID | SearchEngineID | SearchPhrase | AdvEngineID | IsArtifical | 
WindowClientWidth | WindowClientHeight | ClientTimeZone | ClientEventTime | 
Silverlig
 htVersion1 | SilverlightVersion2 | SilverlightVersion3 | SilverlightVersion4 | 
PageCharset | CodeVersion | IsLink | IsDownload | IsNotBounce | FUniqID         
    | OriginalURL | HID       | IsOldCounter | IsEvent | IsParameter | 
DontCountHits | WithHash | HitColor | LocalEventTime | Age | Sex | Income | 
Interests | Robotness | RemoteIP   | WindowName | OpenerName | HistoryLength | 
BrowserLanguage | BrowserCountry | SocialNetwork | SocialAction | HTTPError | 
SendTiming | DNSTiming | ConnectTiming | ResponseStartTiming | 
ResponseEndTiming | FetchTiming | SocialSourceNetworkID | SocialSourcePage | 
ParamPrice | ParamOrderID | ParamCurrency | ParamCurrencyID | 
OpenstatServiceName | OpenstatCampaignID | OpenstatAdID | OpenstatSourceID | 
UTMSource | UTMMedium | UTMCampaign | UTMContent | UTMTerm | FromTag | HasGCLID 
| RefererHash          | URLHash             | CLID |
   
+---------------------+------------+-------------------------------------------------------------------------------------+-----------+------------+-----------+-----------+-------------+----------+--------------------+--------------+-----+-----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+-----------+-------------------+-----------------+---------------+-------------+-----------------+------------------+-----------------+------------+------------+-------------+----------+----------+----------------+----------------+--------------+------------------+----------+-------------+------------------+------------------------+-------------+----------------+----------------+--------------+-------------+-------------+-------------------+--------------------+----------------+-----------------+----------
 
-----------+---------------------+---------------------+---------------------+-------------+-------------+--------+------------+-------------+---------------------+-------------+-----------+--------------+---------+-------------+---------------+----------+----------+----------------+-----+-----+--------+-----------+-----------+------------+------------+------------+---------------+-----------------+----------------+---------------+--------------+-----------+------------+-----------+---------------+---------------------+-------------------+-------------+-----------------------+------------------+------------+--------------+---------------+-----------------+---------------------+--------------------+--------------+------------------+-----------+-----------+-------------+------------+---------+---------+----------+----------------------+---------------------+------+
   ....
   
   5 row(s) fetched.
   Elapsed 3.192 seconds.
   ```
   
   
   So I think we either need to
   1.  Claim DataFusion is doing the correct (though unexpected) thing and 
close the ticket
   2. Take a step back and see if we can implement this usecase (paginate 
results) in a more performant way (e.g. implement `ORDER BY row_number()` first 
and them implement this as an optimization)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to