a8356555 opened a new issue, #9022:
URL: https://github.com/apache/iceberg/issues/9022

   ### Apache Iceberg version
   
   1.4.0
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   FlinkSQL Read From Iceberg using ORDER BY clause caused some data loss.
   Flink version: 1.6.2
   
   1. I have written a table to iceberg using flink
   ```sql
   CREATE CATALOG `mycatalog` 
   WITH (
       'type' = 'iceberg',
       'catalog-impl' = 'org.apache.iceberg.aws.glue.GlueCatalog',
       'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',  
       'warehouse' = 's3://mywarehouse'
   );
   
   
   CREATE TABLE IF NOT EXISTS glue01.mydb.mytable
   (
       `id` INT,
       `price` INT,
       `op` STRING,
       `date` TIMESTAMP,
       `t` TIMESTAMP
   ) 
   PARTITIONED BY (`date`,`op`);
   
   
   INSERT INTO glue01.mydb.mytable
   VALUES 
       (1, 100, 'c', TIMESTAMP '2023-01-01', TIMESTAMP '2023-01-01 00:00:00'), 
       (2, 50, 'c', TIMESTAMP '2023-01-01', TIMESTAMP '2023-01-01 00:00:01'), 
       (3, 25, 'c', TIMESTAMP '2023-01-01', TIMESTAMP '2023-01-01 00:00:02');
   
   
   INSERT INTO glue01.mydb.mytable
   VALUES 
       (1, 123, 'u', TIMESTAMP '2023-01-02', TIMESTAMP '2023-01-02 00:00:00'), 
       (2, 321, 'u', TIMESTAMP '2023-01-02', TIMESTAMP '2023-01-02 00:00:10');
   
   
   INSERT INTO glue01.mydb.mytable
   VALUES 
       (1, 100, 'r', TIMESTAMP '2023-01-03', TIMESTAMP '2023-01-03 00:00:00'),  
       (2, 50, 'r', TIMESTAMP '2023-01-03', TIMESTAMP '2023-01-03 00:00:00'), 
       (3, 25, 'r', TIMESTAMP '2023-01-03', TIMESTAMP '2023-01-03 00:00:00');
   
   
   INSERT INTO glue01.mydb.mytable
   VALUES 
       (3, 25, 'r', TIMESTAMP '2023-01-04', TIMESTAMP '2023-01-04 00:00:00'),
       (3, 25, 'd', TIMESTAMP '2023-01-04', TIMESTAMP '2023-01-04 00:10:00');
   
   
   INSERT INTO glue01.mydb.mytable
   VALUES 
       (4, 100, 'c', TIMESTAMP '2023-01-05', TIMESTAMP '2023-01-05 00:00:00'),
       (4, 123, 'u', TIMESTAMP '2023-01-05', TIMESTAMP '2023-01-05 00:10:00');
   ```
   
   2. Read Data.
   ```sql
   CREATE TEMPORARY TABLE `mytable` (
       `id` INT,
       `price` INT,
       `op` STRING,
       `t` TIMESTAMP(3),
       WATERMARK FOR `t` AS `t`
   ) WITH (
       'connector' = 'iceberg',
       'warehouse' = 's3://mywarehouse',
       'catalog-impl' = 'org.apache.iceberg.aws.glue.GlueCatalog',
       'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',  
       'catalog-name' = 'mycatalog',
       'catalog-database' = 'mydb',
       'catalog-table' = 'mytable'
   )
   
   
   SELECT 
       `id`,
       `price`,
       `op`,
       `t`
   from `mytable` 
   /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
   ```
   the results are not ordered
   ```
   # 
+----+-------------+-------------+--------------------------------+-------------------------+
   # | op |          id |       price |                             op |        
               t |
   # 
+----+-------------+-------------+--------------------------------+-------------------------+
   # | +I |           3 |          25 |                              r | 
2023-01-04 00:00:00.000 |
   # | +I |           3 |          25 |                              d | 
2023-01-04 00:10:00.000 |
   # | +I |           4 |         123 |                              u | 
2023-01-05 00:10:00.000 |
   # | +I |           4 |         100 |                              c | 
2023-01-05 00:00:00.000 |
   # | +I |           1 |         123 |                              u | 
2023-01-02 00:00:00.000 |
   # | +I |           2 |         321 |                              u | 
2023-01-02 00:00:10.000 |
   # | +I |           1 |         100 |                              r | 
2023-01-03 00:00:00.000 |
   # | +I |           2 |          50 |                              r | 
2023-01-03 00:00:00.000 |
   # | +I |           3 |          25 |                              r | 
2023-01-03 00:00:00.000 |
   # | +I |           1 |         100 |                              c | 
2023-01-01 00:00:00.000 |
   # | +I |           2 |          50 |                              c | 
2023-01-01 00:00:01.000 |
   # | +I |           3 |          25 |                              c | 
2023-01-01 00:00:02.000 |
   ```
   I want the results ordered by `t`
   ```sql
   select 
       `id`,
       `price`,
       `op`,
       `t`
   from `_table_b2f233fadf3d47478fac1f13620a08ca` 
   /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
   order by `t`
   ```
   but some data lost
   ```
   # 
+----+-------------+-------------+--------------------------------+-------------------------+
   # | op |          id |       price |                             op |        
               t |
   # 
+----+-------------+-------------+--------------------------------+-------------------------+
   # | +I |           3 |          25 |                              r | 
2023-01-04 00:00:00.000 |
   # | +I |           3 |          25 |                              d | 
2023-01-04 00:10:00.000 |
   # | +I |           4 |         123 |                              u | 
2023-01-05 00:10:00.000 |
   ```
   How could I retrieve data in order without a data loss? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to