wgtmac commented on code in PR #190:
URL: https://github.com/apache/iceberg-cpp/pull/190#discussion_r2299792892
##########
src/iceberg/avro/avro_schema_util.cc:
##########
@@ -65,6 +65,31 @@ ::avro::CustomAttributes GetAttributesWithFieldId(int32_t
field_id) {
} // namespace
+std::string SanitizeFieldName(std::string_view field_name) {
Review Comment:
The implementation is obviously wrong. Please double check the Java impl.
##########
src/iceberg/avro/avro_schema_util.cc:
##########
@@ -188,8 +213,19 @@ Status ToAvroNodeVisitor::Visit(const StructType& type,
::avro::NodePtr* node) {
::avro::NodePtr field_node;
ICEBERG_RETURN_UNEXPECTED(Visit(sub_field, &field_node));
- // TODO(gangwu): sanitize field name
- (*node)->addName(std::string(sub_field.name()));
+ // Sanitize field name to ensure it follows Avro field name requirements
+ std::string sanitized_name = SanitizeFieldName(sub_field.name());
Review Comment:
In most cases, field names do not need sanitization. This may introduce a
large regression.
##########
src/iceberg/avro/avro_schema_util_internal.h:
##########
@@ -163,4 +163,9 @@ Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const
::avro::NodePtr& original
Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const ::avro::NodePtr&
original_node,
const NameMapping& mapping);
+/// \brief Sanitize a field name to make it compatible with Avro field name
requirements.
Review Comment:
Could you please also add a `bool ValidateFieldName(std::string_view)`?
##########
src/iceberg/avro/avro_schema_util.cc:
##########
@@ -188,8 +213,19 @@ Status ToAvroNodeVisitor::Visit(const StructType& type,
::avro::NodePtr* node) {
::avro::NodePtr field_node;
ICEBERG_RETURN_UNEXPECTED(Visit(sub_field, &field_node));
- // TODO(gangwu): sanitize field name
- (*node)->addName(std::string(sub_field.name()));
+ // Sanitize field name to ensure it follows Avro field name requirements
+ std::string sanitized_name = SanitizeFieldName(sub_field.name());
+
+ // Store original name as a custom attribute if it was modified
+ if (sanitized_name != sub_field.name()) {
Review Comment:
ditto, we can avoid string comparison by validating the original field name
at the beginning.
##########
src/iceberg/avro/avro_schema_util.cc:
##########
@@ -839,7 +922,15 @@ Result<::avro::NodePtr> CreateRecordNodeWithFieldIds(const
::avro::NodePtr& orig
// Recursively apply field IDs to nested fields
ICEBERG_ASSIGN_OR_RAISE(auto new_nested_node,
MakeAvroNodeWithFieldIds(field_node,
*nested_field));
- new_record_node->addName(field_name);
+ std::string sanitized_name = SanitizeFieldName(field_name);
Review Comment:
It worths a common function.
##########
src/iceberg/avro/avro_schema_util_internal.h:
##########
@@ -163,4 +163,9 @@ Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const
::avro::NodePtr& original
Result<::avro::NodePtr> MakeAvroNodeWithFieldIds(const ::avro::NodePtr&
original_node,
const NameMapping& mapping);
+/// \brief Sanitize a field name to make it compatible with Avro field name
requirements.
Review Comment:
In addition, it would be good to explicitly state what are the `Avro field
name requirements` in the comment.
##########
test/avro_schema_test.cc:
##########
@@ -47,8 +47,56 @@ void CheckFieldIdAt(const ::avro::NodePtr& node, size_t
index, int32_t field_id,
ASSERT_EQ(attrs.getAttribute(key),
std::make_optional(std::to_string(field_id)));
}
+// Helper function to check if a custom attribute exists for a field name
preservation
+void CheckIcebergFieldName(const ::avro::NodePtr& node, size_t index,
+ const std::string& original_name) {
+ ASSERT_LT(index, node->customAttributes());
+ const auto& attrs = node->customAttributesAt(index);
+ ASSERT_EQ(attrs.getAttribute("iceberg-field-name"),
std::make_optional(original_name));
+}
+
} // namespace
+TEST(SanitizeFieldNameTest, ValidFieldNames) {
+ // Valid field names should remain unchanged
+ EXPECT_EQ(SanitizeFieldName("valid_field"), "valid_field");
+ EXPECT_EQ(SanitizeFieldName("field123"), "field123");
+ EXPECT_EQ(SanitizeFieldName("_private"), "_private");
+ EXPECT_EQ(SanitizeFieldName("CamelCase"), "CamelCase");
+ EXPECT_EQ(SanitizeFieldName("field_with_underscores"),
"field_with_underscores");
+}
+
+TEST(SanitizeFieldNameTest, InvalidFieldNames) {
+ // Field names starting with numbers should be prefixed with underscore
+ EXPECT_EQ(SanitizeFieldName("123field"), "_123field");
+ EXPECT_EQ(SanitizeFieldName("0value"), "_0value");
+
+ // Field names with special characters should have them replaced with
underscores
+ EXPECT_EQ(SanitizeFieldName("field-name"), "field_name");
+ EXPECT_EQ(SanitizeFieldName("field.name"), "field_name");
+ EXPECT_EQ(SanitizeFieldName("field name"), "field_name");
+ EXPECT_EQ(SanitizeFieldName("field@name"), "field_name");
+ EXPECT_EQ(SanitizeFieldName("field#name"), "field_name");
+
+ // Complex field names with multiple issues
+ EXPECT_EQ(SanitizeFieldName("1field-with.special@chars"),
"_1field_with_special_chars");
+ EXPECT_EQ(SanitizeFieldName("user-email"), "user_email");
+ EXPECT_EQ(SanitizeFieldName("价格"),
+ "_______"); // Non-ASCII characters become underscores
+}
+
+TEST(SanitizeFieldNameTest, EdgeCases) {
+ // Empty field name
+ EXPECT_EQ(SanitizeFieldName(""), "_empty");
+
+ // Field name with only special characters
+ EXPECT_EQ(SanitizeFieldName("@#$"), "____");
Review Comment:
Is this PR mostly generated by AI? This is obviously wrong. I'd suggest to
read the Java implementation in detail and be responsible of every line of code
to avoid wasting time of reviewers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]