xiedeyantu commented on code in PR #22370: URL: https://github.com/apache/datafusion/pull/22370#discussion_r3281321403
########## datafusion/optimizer/src/expand_join_or_predicate.rs: ########## @@ -0,0 +1,174 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! [`ExpandJoinOrPredicate`] rewrites inner joins with OR filters into a UNION ALL +//! of mutually exclusive hashjoin-capable inner joins. + +use crate::optimizer::ApplyOrder; +use crate::{OptimizerConfig, OptimizerRule}; +use std::sync::Arc; + +use datafusion_common::tree_node::Transformed; +use datafusion_common::Result; +use datafusion_expr::logical_plan::{Join, LogicalPlan, Projection, Union}; +use datafusion_expr::utils::{can_hash, find_valid_equijoin_key_pair, split_binary_owned, split_conjunction_owned}; +use datafusion_expr::{Expr, ExprSchemable, JoinType, Operator}; + +#[derive(Default, Debug)] +pub struct ExpandJoinOrPredicate; + +impl ExpandJoinOrPredicate { Review Comment: @2010YOUY01 Thank you so much for your detailed explanation of this logic! I think it's a very good idea, and for inner joins, this implementation is optimal, completing the entire logic directly at the physical execution layer. Regarding my current proposal, I support this implementation. However, since I'm not entirely clear on the execution-level logic, implementing it using DisjointHashJoinExec might take some time. Actually, there's 2 PRs(https://github.com/apache/calcite/pull/4300, https://github.com/apache/calcite/pull/4315) I submitted to Calcite, This rule file (https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/rules/JoinExpandOrToUnionRule.java) will be more intuitive. where I implemented inner/left/right/full/anti joins (I didn't implement semi-join because its semantics are not easily split into multiple mutually exclusive joins). Their methods for splitting multiple join branches differ (refer to the comments in the connection code above). This isn't easily implemented using DisjointHashJoinExec; for example, left joins would be split into inner and anti joins. I limited the PR to inner joins because I wanted to implement the first step first, as the performance improvement was significant when testing joins of two tables (1000+ rows). I can't construct SQL to test other scenarios yet, as they all involve anti joins. If everyone accepts this soluti on, I will expand it to support more join types later. This is why I implemented it as a separate rule. There's another reason, which I also mentioned above: when the table only has 10 rows of data, the overhead becomes apparent, slowing down this optimization. Therefore, parameters or statistical information can help decide whether to rewrite it. Regarding your first question, I think we can achieve scan reuse at either the logical or physical layer; we don't need to worry too much about rewriting the logic. Regarding the second question, I think it might be difficult for us to do, perhaps because my understanding of DataFusion's execution layer is still limited. Regarding the third question, similar to the first, the complexity of the plan is not necessarily directly related to the actual execution logic or performance. This is just my personal opinion; please correct me if I'm wrong. Thank you very much for participating in the discussion. If we're only considering inner joins (without expanding to other join types), I personally like the first solution @Dandandan mentioned. @2010YOUY01 If you've already implemented it, then I'll close this PR. If you're interested in further expansions of this PR, I can implement these capabilities through multiple PRs. Looking forward to your reply! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
