zhuqi-lucas commented on PR #21182:
URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4186222723

   > Actually codex and I found a wrong results bug with this implementation
   > 
   > Reproducer
   > 
   > ```sql
   > SET datafusion.optimizer.enable_sort_pushdown = true;
   >   SET datafusion.execution.target_partitions = 1;
   > 
   >   CREATE TABLE null_src_a(id INT) AS VALUES
   >     (1),
   >     (NULL);
   > 
   >   COPY (
   >     SELECT * FROM null_src_a ORDER BY id ASC NULLS LAST
   >   )
   >   TO 'nulls_repro/b_null_tail.parquet';
   > 
   >   CREATE TABLE null_src_b(id INT) AS VALUES
   >     (2),
   >     (3);
   > 
   >   COPY (
   >     SELECT * FROM null_src_b ORDER BY id ASC NULLS LAST
   >   )
   >   TO 'nulls_repro/a_nonnull.parquet';
   > 
   >   CREATE EXTERNAL TABLE t(id INT)
   >   STORED AS PARQUET
   >   LOCATION 'nulls_repro/'
   >   WITH ORDER (id ASC NULLS LAST);
   > 
   >   EXPLAIN SELECT * FROM t ORDER BY id ASC NULLS LAST;
   >   SELECT * FROM t ORDER BY id ASC NULLS LAST;
   > ```
   > 
   > The correct result (on main) is
   > 
   > ```sql
   >   SELECT * FROM t ORDER BY id ASC NULLS LAST;
   > +------+
   > | id   |
   > +------+
   > | 1    |
   > | 2    |
   > | 3    |
   > | NULL |
   > +------+
   > 4 row(s) fetched.
   > ```
   > 
   > This branch results in
   > 
   > ```sql
   >   SELECT * FROM t ORDER BY id ASC NULLS LAST;
   > 
   > +------+
   > | id   |
   > +------+
   > | 1    |
   > | NULL |
   > | 2    |
   > | 3    |
   > +------+
   > ```
   > 
   > The issue I think is that this branch incorrectly removes the sort
   > 
   > plan on main
   > 
   > ```
   > +---------------+-------------------------------+
   > | plan_type     | plan                          |
   > +---------------+-------------------------------+
   > | physical_plan | ┌───────────────────────────┐ |
   > |               | │          SortExec         │ |
   > |               | │    --------------------   │ |
   > |               | │    id@0 ASC NULLS LAST    │ |
   > |               | └─────────────┬─────────────┘ |
   > |               | ┌─────────────┴─────────────┐ |
   > |               | │       DataSourceExec      │ |
   > |               | │    --------------------   │ |
   > |               | │          files: 2         │ |
   > |               | │      format: parquet      │ |
   > |               | └───────────────────────────┘ |
   > |               |                               |
   > +---------------+-------------------------------+
   > ```
   > 
   > plan on branch
   > 
   > ```
   > +---------------+-------------------------------+
   > | plan_type     | plan                          |
   > +---------------+-------------------------------+
   > | physical_plan | ┌───────────────────────────┐ |
   > |               | │       DataSourceExec      │ |
   > |               | │    --------------------   │ |
   > |               | │          files: 2         │ |
   > |               | │      format: parquet      │ |
   > |               | └───────────────────────────┘ |
   > |               |                               |
   > +---------------+-------------------------------+
   > ```
   
   
   Thanks @alamb for the review and for finding this! Great catch.
   
   The issue is that MinMaxStatistics ignores NULLs — it only looks at min/max 
values. So b_null_tail(min=1, max=1) appears non-overlapping with 
a_nonnull(min=2, max=3), but the NULL in b_null_tail should come
     after all non-null values when NULLS LAST.
   
   Fixed in the latest push by adding a null-count check in 
try_sort_file_groups_by_statistics — if any file has NULLs in the sort columns 
(based on Precision::Exact null counts from statistics), we do not upgrade to 
Exact. The SortExec is preserved to handle NULL ordering correctly.
   
   I also added unit tests and slt test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to