adriangb commented on code in PR #21965:
URL: https://github.com/apache/datafusion/pull/21965#discussion_r3171778155


##########
datafusion/catalog-listing/src/options.rs:
##########
@@ -280,15 +283,13 @@ impl ListingOptions {
     ) -> datafusion_common::Result<SchemaRef> {
         let store = state.runtime_env().object_store(table_path)?;
 
-        let files: Vec<_> = table_path
+        let all_files: Vec<_> = table_path
             .list_all_files(state, store.as_ref(), &self.file_extension)
             .await?
-            // Empty files cannot affect schema but may throw when trying to 
read for it
-            .try_filter(|object_meta| future::ready(object_meta.size > 0))

Review Comment:
   This just means we carry around memory for the ObjectMeta of zero sized 
files until a couple lines later. I think this is not a big problem.
   
   The alternative is that we error even where there are 0 byte files present. 
I think that's an interesting discussion: e.g. a completely empty `data.csv`. 
Or hive partitioned directories with no data. I think all of these should still 
require an explicit schema or error, but there are tests that check the 
opposite behavior:
   
     - test_csv_empty_file — registers tests/data/empty_0_byte.csv (0 bytes, no 
header, no data) and runs SELECT * FROM empty.                                  
                                                                                
                                         
     - test_csv_multiple_empty_files — folder of 0-byte CSVs. Same situation.   
                                                                                
                                                                                
                                                                                
                                   
     - it_can_read_empty_ndjson — 0-byte JSON file. Same.                       
                                                                                
                                                                                
                                                                                
                                   
     - test_read_empty_parquet — 0-byte parquet file. Same.                     
                                                                                
                                                                                
                                                                                
                                   
     - test_read_partitioned_empty_parquet — partition dir with a 0-byte 
parquet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to