puchengy opened a new issue, #46:
URL: https://github.com/apache/iceberg-python/issues/46

   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   v1 data file spec_id is optionally, but it seems spark is able to recognize 
the spec_id, but pyiceberg can't, any idea why?
   
   spark
   ```
   spark-sql> select * from pyang.test_ray_iceberg_read.files;
   content      file_path       file_format     spec_id partition       
record_count    file_size_in_bytes      column_sizes    value_counts    
null_value_counts       nan_value_counts        lower_bounds    upper_bounds    
key_metadata    split_offsets   equality_ids    sort_order_id   readable_metrics
   0    
s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet
     PARQUET 1       {"dt":"2022-01-02","userid_bucket_16":4}        1       
871     {1:36,2:37,3:46}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      
{1:,2:2,3:2022-01-02}   {1:,2:2,3:2022-01-02}   NULL    [4]     NULL    0       
{"col":{"column_size":37,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2","upper_bound":"2"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-02","upper_bound":"2022-01-02"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":2,"upper_bound":2}}
   0    
s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet
        PARQUET 0       {"dt":"2022-01-01","userid_bucket_16":null}     1       
870     {1:36,2:36,3:46}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      
{1:,2:1,3:2022-01-01}   {1:,2:1,3:2022-01-01}   NULL    [4]     NULL    0       
{"col":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"1","upper_bound":"1"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-01","upper_bound":"2022-01-01"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}}
   Time taken: 0.494 seconds, Fetched 2 row(s)
   ```
   
   pyiceberg (0.4.0)
   ```
   >>> tasks2[0]
   
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet',
 file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-02', 
userid_bucket_16=4], record_count=1, file_size_in_bytes=871, column_sizes={1: 
36, 2: 37, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 
0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 
3: b'2022-01-02'}, upper_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: 
b'2022-01-02'}, key_metadata=None, split_offsets=[4], sort_order_id=0, 
content=DataFileContent.DATA, equality_ids=None, spec_id=None], 
delete_files=set(), start=0, length=871)
   >>> tasks2[1]
   
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet',
 file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-01'], 
record_count=1, file_size_in_bytes=870, column_sizes={1: 36, 2: 36, 3: 46}, 
value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, 
nan_value_counts={}, lower_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: 
b'2022-01-01'}, upper_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: 
b'2022-01-01'}, key_metadata=None, split_offsets=[4], sort_order_id=0, 
content=DataFileContent.DATA, equality_ids=None, spec_id=None], 
delete_files=set(), start=0, length=870)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to