lolloz98 opened a new issue, #46796:
URL: https://github.com/apache/arrow/issues/46796

   ### Describe the enhancement requested
   
   Let's add the option to partition vertically: e.g.
   file1: col1, col2
   file2: col3
   provided that:
   - file1 and file2 col names are different
   - file1 and file2 have same number of rowgroups
   - rowgroups with matching index, match num rows
   - (whatever assumption I am forgetting)
   
   It should be possible to merge only the metadata files into a new metadata 
only parquet file, so that we do not need to perform copy and rewrites.
   
   What I was thinking was to simply add to FileMetadata a method called 
MergeHorizontally which takes another FileMetadata, checks conditions and 
modifies the original FileMetadata object as needed to then dump it to a new 
file -- special care will be needed in using set_file_path correctly on each 
FileMetadata. It will be tasked to the user on how to handle the key-value 
metadata (and I think this would be great from the user to be able to merge as 
he needs the two key-value file level metadata)
   
   tl;dr
   
   I have a case in which it happens that we cannot write all the columns 
together, because we don't have the data for some of the columns. 
   
   As an example: we have the data for column1 and column2 and we write them. 
Then we get the data for column3 (or we compute it separately) and dump that in 
a new file.
   
   After extensive research, there are ways to handle this: using polars lazy 
frames, using duckdb positional join... However none of these methods is ideal: 
we don't have the granular control that parquet provides, and we end up with 
worse query plan when doing something a bit more complex (worst case, we cannot 
really use the information that the row group sizes are the same across the two 
files, ending up needing to read the entire files....)
   
   I see that there is support for splitting a file horizontally.
   
   One can use Hive partitioning, or directly write multiple files split by 
rows.
   
   In the second way, I see that in FileMetaData class there is support for 
AppendRowGroups.
   
   In a similar way we could add a MergeHorizontally method. By doing so, I 
think that there is a good chance that all readers out there should be ready ti 
read the newly created metadata file to read the partitioned parquet as one.
   As I mentioned, the great benefit is that we can take full advantage of 
rowGroups having the same number of rows. Moreover we can more easily handle 
massive number of columns.
   The good thing is that, from my understanding, this change would already be 
in compliance with the specifications.
   
   ### Component(s)
   
   C#, Python, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to