adriangb commented on code in PR #21807:
URL: https://github.com/apache/datafusion/pull/21807#discussion_r3138208467


##########
datafusion/physical-expr/src/expressions/dynamic_filters.rs:
##########
@@ -88,6 +92,148 @@ struct Inner {
     is_complete: bool,
 }
 
+/// An atomic snapshot of a [`DynamicFilterPhysicalExpr`] used to reconstruct 
the expression during
+/// serialization / deserialization.
+pub struct DynamicFilterSnapshot {

Review Comment:
   I don't think this is necessary or makes things any simpler



##########
datafusion/proto/src/physical_plan/to_proto.rs:
##########
@@ -281,14 +277,12 @@ pub fn serialize_physical_expr_with_converter(
             )),
         };
         return Ok(protobuf::PhysicalExprNode {
-            expr_id: None,

Review Comment:
   Keeping `expr_id` will also reduce the diff.



##########
datafusion/proto/src/physical_plan/mod.rs:
##########
@@ -3839,85 +3836,16 @@ impl PhysicalProtoConverterExtension for 
DefaultPhysicalProtoConverter {
     }
 }
 
-/// Internal serializer that adds expr_id to expressions.
-/// Created fresh for each serialization operation.
-struct DeduplicatingSerializer {

Review Comment:
   As per comment above I think we should keep `expr_id` on `PhysicalExprNode`. 
But I think the drop in complexity and the fact that this is added to the 
public `PhysicalExpr` trait justifies dropping this specialized serializer and 
instead having the default one call `PhysicalExpr::expr_id()` and put that into 
the proto.



##########
datafusion/proto/proto/datafusion.proto:
##########
@@ -875,13 +875,8 @@ message PhysicalExprNode {
   // Was date_time_interval_expr
   reserved 17;
 
-  // Unique identifier for this expression to do deduplication during 
deserialization.
-  // When serializing, this is set to a unique identifier for each combination 
of
-  // expression, process and serialization run.
-  // When deserializing, if this ID has been seen before, the cached Arc is 
returned
-  // instead of creating a new one, enabling reconstruction of referential 
integrity
-  // across serde roundtrips.
-  optional uint64 expr_id = 30;
+  // Was expr_id
+  reserved 30;

Review Comment:
   Why move this? For reasons previously discussed it might be helpful to have 
an expression id on arbitrary expressions, e.g. to deduplicate large lists or 
strings. Leaving it here also minimizes changes / breakage.



##########
datafusion/physical-expr/src/expressions/dynamic_filters.rs:
##########
@@ -88,6 +92,148 @@ struct Inner {
     is_complete: bool,
 }
 
+/// An atomic snapshot of a [`DynamicFilterPhysicalExpr`] used to reconstruct 
the expression during
+/// serialization / deserialization.
+pub struct DynamicFilterSnapshot {

Review Comment:
   I see now that this is for atomicity. That's a good bug catch! But that sort 
of thing should be called out in the docstring, otherwise readers need to dig 
into implementations to understand why this module exists.
   
   It's an unfortunate situation that all expressions / execution nodes have to 
make their internal details public to all callers just so that proto 
serialization can serialize them. I think we (DataFusion) should at least 
consider some hybrid where we split the proto crate into `proto-models` and 
`proto` so that anything can depend on `proto-models` (behind a feature flag?) 
and thus expressions can define their own serialization into `prost` structs. 
cc @alamb 
   
   All of this said: maybe it would be easier to just make `Inner` public and 
have an `inner() -> Inner` method that clones it? I would suggest if we do this 
we add warnings in docstrings to not assume this will always be public/stable, 
it's only for proto, etc.



##########
datafusion/physical-expr/src/expressions/dynamic_filters.rs:
##########
@@ -346,6 +493,11 @@ impl DynamicFilterPhysicalExpr {
 
         write!(f, " ]")
     }
+
+    /// Generate a new expression id for this filter.
+    pub fn new_expression_id() -> u64 {
+        random::<u64>()
+    }

Review Comment:
   Why is this a public function?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to