Re: [PR] fix: sanitize invalid Avro field names in manifest file [iceberg-python]

via GitHub Thu, 31 Jul 2025 13:20:07 -0700


kevinjqliu commented on code in PR #2245:
URL: https://github.com/apache/iceberg-python/pull/2245#discussion_r2246261560



##########
pyiceberg/utils/schema_conversion.py:
##########
@@ -524,12 +532,19 @@ def field(self, field: NestedField, field_result: 
AvroType) -> AvroType:
         if isinstance(field_result, dict) and field_result.get("type") == 
"record":
             field_result["name"] = f"r{field.field_id}"
 
+        orig_field_name = field.name
+        is_valid_field_name = _valid_avro_name(orig_field_name)
+        field_name = orig_field_name if is_valid_field_name else 
make_compatible_name(orig_field_name)
+
         result = {
-            "name": field.name,
-            "field-id": field.field_id,
+            "name": field_name,
+            FIELD_ID_PROP: field.field_id,
             "type": field_result if field.required else ["null", field_result],
         }
 
+        if not is_valid_field_name:
+            result[ICEBERG_FIELD_NAME_PROP] = orig_field_name

Review Comment:
   thanks for adding this, heres the relevant code from java
   
https://github.com/apache/iceberg/blob/1bd8d5e2de56d05180030b856ce2c50c66ef1f13/core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java#L117-L120



##########
tests/integration/test_avro_compatibility.py:
##########
@@ -0,0 +1,335 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import tempfile
+
+import pytest
+from fastavro import reader
+
+import pyiceberg.avro.file as avro
+from pyiceberg.io.pyarrow import PyArrowFileIO
+from pyiceberg.schema import Schema
+from pyiceberg.typedef import Record
+from pyiceberg.types import IntegerType, NestedField, StringType
+from pyiceberg.utils.schema_conversion import AvroSchemaConversion
+
+
+class AvroTestRecord(Record):
+    """Test record class for Avro compatibility testing."""
+
+    @property
+    def valid_field(self) -> str:
+        return self._data[0]
+
+    @property
+    def invalid_field(self) -> int:
+        return self._data[1]
+
+    @property
+    def field_with_dot(self) -> str:
+        return self._data[2]
+
+    @property
+    def field_with_hash(self) -> int:
+        return self._data[3]
+
+    @property
+    def field_starting_with_digit(self) -> str:
+        return self._data[4]
+
+
[email protected]
+def test_avro_compatibility() -> None:
+    """Test that Avro files with sanitized names can be read by other tools."""
+
+    schema = Schema(
+        NestedField(field_id=1, name="valid_field", field_type=StringType(), 
required=True),
+        NestedField(field_id=2, name="invalid.field", 
field_type=IntegerType(), required=True),
+        NestedField(field_id=3, name="field_with_dot", 
field_type=StringType(), required=True),
+        NestedField(field_id=4, name="field_with_hash", 
field_type=IntegerType(), required=True),

Review Comment:
   nit: are these suppose to have `.` and `#`? 
   
   field_id=2 also took care of the `.` case
   



##########
pyiceberg/schema.py:
##########
@@ -1391,7 +1409,9 @@ def _sanitize_name(name: str) -> str:
 
 
 def _sanitize_char(character: str) -> str:
-    return "_" + character if character.isdigit() else "_x" + 
hex(ord(character))[2:].upper()
+    if character.isdigit():
+        return "_" + character
+    return "_x" + hex(ord(character))[2:].upper()

Review Comment:
   reviewer note, this is the same implementation, just refactored



##########
pyiceberg/utils/schema_conversion.py:
##########
@@ -524,12 +532,19 @@ def field(self, field: NestedField, field_result: 
AvroType) -> AvroType:
         if isinstance(field_result, dict) and field_result.get("type") == 
"record":
             field_result["name"] = f"r{field.field_id}"
 
+        orig_field_name = field.name
+        is_valid_field_name = _valid_avro_name(orig_field_name)
+        field_name = orig_field_name if is_valid_field_name else 
make_compatible_name(orig_field_name)

Review Comment:
   nit: `make_compatible_name` already calls `is_valid_field_name` 
   
https://github.com/apache/iceberg-python/blob/904c0b77c33768716e8672b38bae73bfcf565fbf/pyiceberg/schema.py#L1357-L1361
   
   we can simply this part
   ```suggestion
           field_name = make_compatible_name(orig_field_name)
   ```
   



##########
tests/integration/test_avro_compatibility.py:
##########


Review Comment:
   thanks for adding these tests! one nit is that these aren't "integration 
tests" per se. Integration tests in pyiceberg rely on the docker infra (for 
spark/iceberg rest catalog/hive/etc).
   
   Integration tests are more difficult to run because they require setup. 
   
   I think none of these tests rely on any infra, can we move them to regular 
tests? maybe in `test_avro_sanitization.py`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: sanitize invalid Avro field names in manifest file [iceberg-python]

Reply via email to