Fokko commented on code in PR #6775:
URL: https://github.com/apache/iceberg/pull/6775#discussion_r1129941505


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -484,12 +490,82 @@ def expression_to_pyarrow(expr: BooleanExpression) -> 
pc.Expression:
     return boolean_expression_visit(expr, _ConvertToArrowExpression())
 
 
-def _file_to_table(
+@lru_cache
+def _get_file_format(file_format: FileFormat, **kwargs: Dict[str, Any]) -> 
ds.FileFormat:
+    if file_format == FileFormat.PARQUET:
+        return ds.ParquetFileFormat(**kwargs)
+    else:
+        raise ValueError(f"Unsupported file format: {file_format}")
+
+
+def _construct_fragment(fs: FileSystem, data_file: DataFile, 
file_format_kwargs: Dict[str, Any] = EMPTY_DICT) -> ds.Fragment:
+    _, path = PyArrowFileIO.parse_location(data_file.file_path)
+    return _get_file_format(data_file.file_format, 
**file_format_kwargs).make_fragment(path, fs)
+
+
+def _read_deletes(fs: FileSystem, data_file: DataFile) -> Dict[str, 
pa.ChunkedArray]:
+    delete_fragment = _construct_fragment(
+        fs, data_file, file_format_kwargs={"dictionary_columns": 
("file_path",), "pre_buffer": True, "buffer_size": ONE_MEGABYTE}

Review Comment:
   Good question. I noticed that my deletes parquet file is only 81kb on disk. 
Looking at it in memory:
   ```
   >>> 
pq.read_table(source='00098-7715-aedfa79b-f3b0-423d-8753-0224852e5aa2-00001-deletes.parquet').nbytes
   7358065
   >>> 
pq.read_table(source='00098-7715-aedfa79b-f3b0-423d-8753-0224852e5aa2-00001-deletes.parquet').to_pandas().info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 51455 entries, 0 to 51454
   Data columns (total 2 columns):
    #   Column     Non-Null Count  Dtype 
   ---  ------     --------------  ----- 
    0   file_path  51455 non-null  object
    1   pos        51455 non-null  int64 
   dtypes: int64(1), object(1)
   memory usage: 804.1+ KB
   ```
   With `read_dictionary`:
   ```
   >>> 
pq.read_table(source='00098-7715-aedfa79b-f3b0-423d-8753-0224852e5aa2-00001-deletes.parquet',
 read_dictionary=['file_path']).nbytes
   617595
   >>> 
pq.read_table(source='00098-7715-aedfa79b-f3b0-423d-8753-0224852e5aa2-00001-deletes.parquet',
 read_dictionary=['file_path']).to_pandas().info()
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 51455 entries, 0 to 51454
   Data columns (total 2 columns):
    #   Column     Non-Null Count  Dtype   
   ---  ------     --------------  -----   
    0   file_path  51455 non-null  category
    1   pos        51455 non-null  int64   
   dtypes: category(1), int64(1)
   memory usage: 452.5 KB
   ``` 
   It seems to be quite efficient. Also, reading the `file_path` into a 
`DictionaryArray` give us a smaller footprint as well. But I'm also comfortable 
bumping this to 8 megabytes. Keep in mind that this is unrelated to the 
datafiles, I agree that we should keep those at least 8 megabytes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to