[
https://issues.apache.org/jira/browse/PIG-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Noguchi updated PIG-3492:
------------------------------
Attachment: pig-3492-v0.12_01.patch
I see three Jiras that added LogicalRelationalOperator.fixDuplicateUids.
* PIG-3020 (LOJoin) "Duplicate uid in schema" error when joining two relations
derived from the same load statement"
* PIG-3144 (LOGenerate) "Erroneous map entry alias resolution leading to
"Duplicate schema alias" errors"
* PIG-3292 (LOCross) "Logical plan invalid state: duplicate uid in schema
during self-join to get cross product"
I'm skipping PIG-3292 since Daniel reviewed with the comment
"Interplay with ColumnPruner is fine here since nested plan will include
entire required plan branch"
PIG-3020 (LOJoin) actually talks about two separate problems.
(i-1) PigParser failing with 'Duplicate schema alias: age'. Only happened in
0.11.
This was actually about ImplicitSplitInserter's new uid not propagating to
the top foreach.
I believe this issue was fixed later by PIG-3310 ("ImplicitSplitInserter does
not generate new uids for nested schema fields, leading to miscomputations"
fixed only in 0.12). Confirmed by running a simple test without
LogicalRelationalOperator.fixDuplicateUids.
(i-2) 'describe' showing incorrect schema due to duplicate UID. Happened on
0.10 and 0.11.
This was due to 'describe' being called without
LogicalPlanOptimizer.optimize() which includes some important rules like
ImplicitSplitInserter and DuplicateForEachColumnRewrite.
(ii) PIG-3144(LOGenerate) issue seems to have started after a completely
unrelated Jira,
PIG-2710 "Implement Naive CUBE operator" in 0.11.
{noformat}
src/org/apache/pig/parser/LogicalPlanBuilder.java
+ 406 private void expandAndResetVisitor(SourceLocation loc,
+ 407 LogicalRelationalOperator lrop) throws ParserValidationException {
+ 408 try {
+ 409 (new ProjectStarExpander(lrop.getPlan())).visit();
+ 410 (new ProjStarInUdfExpander(lrop.getPlan())).visit();
+ 411 new SchemaResetter(lrop.getPlan(), true).visit();
+ 412 } catch (FrontendException e) {
+ 413 throw new ParserValidationException(intStream, loc, e);
+ 414 }
+ 415 }
934 String buildForeachOp(SourceLocation loc, LOForEach op, String alias,
String inputAlias, LogicalPlan innerPlan)
935 throws ParserValidationException {
936 op.setInnerPlan( innerPlan );
937 alias = buildOp( loc, op, alias, inputAlias, null );
- (new ProjectStarExpander(op.getPlan())).visit(op);
- (new ProjStarInUdfExpander(op.getPlan())).visit(op);
- new SchemaResetter(op.getPlan(), true).visit(op);
+938 expandAndResetVisitor(loc, op);
939 return alias;
940 }
{noformat}
So basically we started traversing the entire plan (visit()) for every operator
builds instead of just the operator it's working on (visit(op)).
This has caused the 'alias' to get updated before
LogicalPlanOptimizer.optimize() -> DuplicateForEachColumnRewrite and causing the
"Duplicate schema alias" error. Rolling back this change seems to bring back
the pre-0.11 behavior.
Uploading an intial patch. Goal is to take out the
LogicalRelationalOperator.fixDuplicateUids. from both PIG-3020(LOJoin) and
PIG-3144(LOGenerate).
(i-1) For release-0.12: No-op. For release-0.11: Backport pig-3310.
(i-2) We can either fix it by forcing compilePp() before describe or moving
ImplicitSplitInserter/DuplicateForEachColumnRewrite to PigServer.compile().
There is a comment that says
{noformat}
./src/org/apache/pig/PigServer.java
1692 private void compile(LogicalPlan lp) throws FrontendException {
....
1699
1700 // TODO: move optimizer here from HExecuteEngine.
1701 // TODO: input/output validation visitor
1702
{noformat}
For now, I'm taking an easy approach of calling compilePp() for describe.
(ii) I'm rolling back small section of PIG-2710 in
src/org/apache/pig/parser/LogicalPlanBuilder.java that was hopefully only for
shortening the code and the change in behavior was unintended.
For now, patch only applies to release 0.12 since it seems like location of
LogicalPlanOptimizer.optimize() may change in the near future (PIG-3508).
> ColumnPrune dropping used column due to
> LogicalRelationalOperator.fixDuplicateUids changes not propagating
> ----------------------------------------------------------------------------------------------------------
>
> Key: PIG-3492
> URL: https://issues.apache.org/jira/browse/PIG-3492
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.11.1, 0.12.1, 0.13.0
> Reporter: Koji Noguchi
> Attachments: pig-3492-v0.12_01.patch
>
>
> I don't have a testcase I can upload at the moment, but here's my observation.
> SplitFilter -> schemaResetter -> LOGenerate.getSchema ->
> LogicalRelationalOperator.fixDuplicateUids() creating a new UID but that UID
> is not propagated to the entire plan (since SplitFilter.reportChanges only
> returns subplan).
> As a result, I am seeing ColumnPruning cutting off those used columns.
--
This message was sent by Atlassian JIRA
(v6.1#6144)