lolloz98 opened a new issue, #46796: URL: https://github.com/apache/arrow/issues/46796
### Describe the enhancement requested Let's add the option to partition vertically: e.g. file1: col1, col2 file2: col3 provided that: - file1 and file2 col names are different - file1 and file2 have same number of rowgroups - rowgroups with matching index, match num rows - (whatever assumption I am forgetting) It should be possible to merge only the metadata files into a new metadata only parquet file, so that we do not need to perform copy and rewrites. What I was thinking was to simply add to FileMetadata a method called MergeHorizontally which takes another FileMetadata, checks conditions and modifies the original FileMetadata object as needed to then dump it to a new file -- special care will be needed in using set_file_path correctly on each FileMetadata. It will be tasked to the user on how to handle the key-value metadata (and I think this would be great from the user to be able to merge as he needs the two key-value file level metadata) tl;dr I have a case in which it happens that we cannot write all the columns together, because we don't have the data for some of the columns. As an example: we have the data for column1 and column2 and we write them. Then we get the data for column3 (or we compute it separately) and dump that in a new file. After extensive research, there are ways to handle this: using polars lazy frames, using duckdb positional join... However none of these methods is ideal: we don't have the granular control that parquet provides, and we end up with worse query plan when doing something a bit more complex (worst case, we cannot really use the information that the row group sizes are the same across the two files, ending up needing to read the entire files....) I see that there is support for splitting a file horizontally. One can use Hive partitioning, or directly write multiple files split by rows. In the second way, I see that in FileMetaData class there is support for AppendRowGroups. In a similar way we could add a MergeHorizontally method. By doing so, I think that there is a good chance that all readers out there should be ready ti read the newly created metadata file to read the partitioned parquet as one. As I mentioned, the great benefit is that we can take full advantage of rowGroups having the same number of rows. Moreover we can more easily handle massive number of columns. The good thing is that, from my understanding, this change would already be in compliance with the specifications. ### Component(s) C#, Python, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org