Re: [PR] Sanitized special character column name before writing to parquet [iceberg-python]

via GitHub Thu, 11 Apr 2024 02:22:19 -0700


HonahX commented on code in PR #590:
URL: https://github.com/apache/iceberg-python/pull/590#discussion_r1560606567



##########
tests/integration/test_writes/test_writes.py:
##########
@@ -270,6 +270,25 @@ def get_current_snapshot_id(identifier: str) -> int:
     assert tbl.current_snapshot().snapshot_id == 
get_current_snapshot_id(identifier)  # type: ignore
 
 
+@pytest.mark.integration
+def test_python_writes_special_character_column_with_spark_reads(spark: 
SparkSession, session_catalog: Catalog) -> None:
+    identifier = 
"default.python_writes_special_character_column_with_spark_reads"
+    column_name_with_special_character = "letter/abc"
+    TEST_DATA_WITH_SPECIAL_CHARACTER_COLUMN = {
+        column_name_with_special_character: ['a', None, 'z'],
+    }
+    pa_schema = pa.schema([
+        (column_name_with_special_character, pa.string()),
+    ])
+    arrow_table_with_special_character_column = 
pa.Table.from_pydict(TEST_DATA_WITH_SPECIAL_CHARACTER_COLUMN, schema=pa_schema)
+    tbl = _create_table(session_catalog, identifier, {"format-version": "1"}, 
schema=pa_schema)

Review Comment:
   Shall we test both version 1 and 2 tables in this test? We can use
   ```python
   @pytest.mark.parametrize("format_version", [1, 2])
   ```



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1772,12 +1772,13 @@ def write_file(io: FileIO, table_metadata: 
TableMetadata, tasks: Iterator[WriteT
     )
 
     def write_parquet(task: WriteTask) -> DataFile:
+        df = pa.Table.from_batches(task.record_batches)
+        df = df.rename_columns(schema.column_names)

Review Comment:
   Shall we extend the integration test to test the nested schema case? For 
example, 
   ```python
           pa.field('name', pa.string()),
           pa.field('address', pa.struct([
               pa.field('street', pa.string()),
               pa.field('city', pa.string()),
               pa.field('zip', pa.int32())
           ]
   ```
   
   Updated: I got 
   ```python
   pyarrow.lib.ArrowInvalid: tried to rename a table of 4 columns but only 7 
names were provided
   ```
   when trying with the following dataset
   ```python
   TEST_DATA_WITH_SPECIAL_CHARACTER_COLUMN = {
           column_name_with_special_character: ['a', None, 'z'],
           'id': [1, 2, 3],
           'name': ['AB', 'CD', 'EF'],
           'address': [
               {'street': '123', 'city': 'SFO', 'zip': 12345},
               {'street': '456', 'city': 'SW', 'zip': 67890},
               {'street': '789', 'city': 'Random', 'zip': 10112}
           ]
       }
       pa_schema = pa.schema([
           pa.field(column_name_with_special_character, pa.string()),
           pa.field('id', pa.int32()),
           pa.field('name', pa.string()),
           pa.field('address', pa.struct([
               pa.field('street', pa.string()),
               pa.field('city', pa.string()),
               pa.field('zip', pa.int32())
           ]))
       ])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Sanitized special character column name before writing to parquet [iceberg-python]

Reply via email to