[ 
https://issues.apache.org/jira/browse/GEODE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinwoo Hwang updated GEODE-10508:
---------------------------------
    Description: 
h2. Summary

Fix nondeterminism warnings generated by ANTLR during compilation of the OQL 
(Object Query Language) grammar file. The warnings indicate ambiguity in the 
parser that prevents it from deterministically choosing between alternative 
production rules.
h2. Problem Statement
h3. Current Behavior

During the {{generateGrammarSource}} task, ANTLR produces the following 
nondeterminism warnings:
{noformat}
> Task :geode-core:generateGrammarSource
/home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:574:40:
 
warning:nondeterminism between alts 1 and 2 of block upon
    k==1:"sum","avg","min","max","count"
    k==2:TOK_LPAREN

/home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:578:13:
 
warning:nondeterminism between alts 1 and 2 of block upon
    k==1:"sum","avg","min","max","count"
    k==2:TOK_LPAREN

/home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:961:26:
 
warning:nondeterminism between alts 1 and 2 of block upon
    k==1:"distinct"
    k==2:TOK_LPAREN,TOK_PLUS,TOK_MINUS,TOK_DOLLAR,...

/home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:979:17:
 
warning:nondeterminism between alts 1 and 2 of block upon
    k==1:"distinct"
    k==2:TOK_LPAREN,TOK_PLUS,TOK_MINUS,TOK_DOLLAR,...
{noformat}
h3. Root Cause Analysis

*Lines 574 & 578 (projection rule):*

The parser cannot distinguish between {{aggregateExpr}} and {{expr}} 
alternatives when it encounters aggregate function keywords ({{{}sum{}}}, 
{{{}avg{}}}, {{{}min{}}}, {{{}max{}}}, {{{}count{}}}) because:
 * These keywords can start an aggregate expression like {{sum(field)}}
 * The same keywords can also be used as identifiers in regular expressions
 * Without lookahead, the parser cannot determine which production rule to apply

*Lines 961 & 979 (aggregateExpr rule):*

The optional {{distinct}} keyword creates ambiguity:
 * The parser cannot decide whether to match the optional {{distinct}} keyword 
or skip it and proceed directly to the expression
 * This is because both alternatives are valid, and the default behavior 
doesn't specify preference

h2. Proposed Solution
h3. 1. Add Syntactic Predicates (Lines 574 & 578)

Use ANTLR syntactic predicates to perform lookahead and resolve ambiguity in 
the {{projection}} rule:
{code}
Unable to find source-code formatter for language: antlr. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml// 
Before:
( tok1:aggregateExpr{node = #tok1;} | tok2:expr{node = #tok2;})

// After:
( (("sum"|"avg"|"min"|"max"|"count") TOK_LPAREN)=> tok1:aggregateExpr{node = 
#tok1;} | tok2:expr{node = #tok2;})
{code}
The predicate {{(("sum"|"avg"|"min"|"max"|"count") TOK_LPAREN)=>}} tells the 
parser:
 * Look ahead to check if an aggregate keyword is followed by a left parenthesis
 * If true, choose the {{aggregateExpr}} alternative
 * Otherwise, choose the {{expr}} alternative

h3. 2. Add Greedy Option (Lines 961 & 979)

Use the {{greedy}} option for optional {{distinct}} keywords in the 
{{aggregateExpr}} rule:
{code}
Unable to find source-code formatter for language: antlr. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml// 
Before:
TOK_LPAREN ("distinct"! {distinctOnly = true;} ) ? tokExpr1:expr TOK_RPAREN

// After:
TOK_LPAREN (options {greedy=true;}: "distinct"! {distinctOnly = true;} ) ? 
tokExpr1:expr TOK_RPAREN
{code}
The {{greedy}} option instructs the parser to match the {{distinct}} keyword 
whenever it appears, removing ambiguity.
h3. 3. Fix Test Compatibility Issue

The grammar changes will alter token numbering in the generated lexer. Update 
{{AbstractCompiledValueTestJUnitTest.java}} to use constants instead of 
hardcoded token values:
{code:java}
// Before:
new CompiledJunction(new CompiledValue[] {compiledValue1, compiledValue2}, 89)

// After:
import org.apache.geode.cache.query.internal.parse.OQLLexerTokenTypes;
...
new CompiledJunction(new CompiledValue[] {compiledValue1, compiledValue2}, 
OQLLexerTokenTypes.LITERAL_or)
{code}
*Rationale:* Adding syntactic predicates changes token numbering (e.g., 
{{LITERAL_or}} changes from 89 to 94). Using the constant ensures tests remain 
correct regardless of future grammar modifications.
h2. Implementation Plan
h3. Phase 1: Grammar Modifications
 # Update 
{{geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g}}
 # Add syntactic predicates to the {{projection}} rule (2 locations)
 # Add {{greedy}} option to {{aggregateExpr}} rule (2 locations)
 # Add inline comments explaining the reasoning for each change

h3. Phase 2: Test Updates
 # Update {{AbstractCompiledValueTestJUnitTest.java}}
 # Add import for {{OQLLexerTokenTypes}}
 # Replace hardcoded value {{89}} with {{OQLLexerTokenTypes.LITERAL_or}}
 # Add comments explaining why the constant is used

h3. Phase 3: Validation
 # Run {{./gradlew :geode-core:generateGrammarSource}} to verify no warnings
 # Run {{./gradlew :geode-core:test --tests 
"org.apache.geode.cache.query.internal.parse.*"}} to verify parser tests pass
 # Run {{./gradlew :geode-core:test --tests 
"org.apache.geode.cache.query.internal.Abstract*"}} to verify updated test 
passes
 # Run full {{geode-core}} test suite to ensure no regressions

h2. Expected Outcomes
h3. Success Criteria

(/) Zero nondeterminism warnings from ANTLR grammar generation
(/) All existing query parsing tests pass
(/) {{AbstractCompiledValueTestJUnitTest}} passes with updated token usage
(/) No behavior changes to OQL query parsing or execution
(/) Token numbering is stable and predictable
h3. Impact Assessment

*Risk Level:* Low
 * Changes only affect parser generation, not runtime behavior
 * Syntactic predicates and greedy options are standard ANTLR features
 * All test coverage validates correct parsing behavior

*Performance Impact:* None
 * Grammar changes only affect compile-time parser generation
 * No runtime performance impact

*Compatibility:* Maintained
 * No changes to OQL syntax or semantics
 * Fully backward compatible with existing queries

h2. Files to be Modified
 # 
{{geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g}} 
(Grammar file)
 # 
{{geode-core/src/test/java/org/apache/geode/cache/query/internal/AbstractCompiledValueTestJUnitTest.java}}
 (Test file)

h2. References
 * [ANTLR 2 Documentation - Syntactic 
Predicates|https://www.antlr2.org/doc/predicates.html]
 * [ANTLR 2 Documentation - Greedy 
Subrules|https://www.antlr2.org/doc/options.html]
 * Apache Geode OQL Documentation

h2. Testing Strategy
h3. Unit Tests
 * Verify all query parser tests pass
 * Verify aggregate function parsing works correctly
 * Verify distinct keyword handling in aggregate functions

h3. Integration Tests
 * Run OQL query tests with aggregate functions
 * Test queries with projection containing aggregate expressions
 * Test queries mixing aggregate and non-aggregate projections

h3. Regression Tests
 * Run full geode-core test suite
 * Verify no changes to generated parser behavior
 * Confirm token numbering stability

> Remediation of ANTLR Nondeterminism Warnings in OQL Grammar
> -----------------------------------------------------------
>
>                 Key: GEODE-10508
>                 URL: https://issues.apache.org/jira/browse/GEODE-10508
>             Project: Geode
>          Issue Type: Improvement
>            Reporter: Jinwoo Hwang
>            Priority: Major
>
> h2. Summary
> Fix nondeterminism warnings generated by ANTLR during compilation of the OQL 
> (Object Query Language) grammar file. The warnings indicate ambiguity in the 
> parser that prevents it from deterministically choosing between alternative 
> production rules.
> h2. Problem Statement
> h3. Current Behavior
> During the {{generateGrammarSource}} task, ANTLR produces the following 
> nondeterminism warnings:
> {noformat}
> > Task :geode-core:generateGrammarSource
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:574:40:
>  
> warning:nondeterminism between alts 1 and 2 of block upon
>     k==1:"sum","avg","min","max","count"
>     k==2:TOK_LPAREN
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:578:13:
>  
> warning:nondeterminism between alts 1 and 2 of block upon
>     k==1:"sum","avg","min","max","count"
>     k==2:TOK_LPAREN
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:961:26:
>  
> warning:nondeterminism between alts 1 and 2 of block upon
>     k==1:"distinct"
>     k==2:TOK_LPAREN,TOK_PLUS,TOK_MINUS,TOK_DOLLAR,...
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:979:17:
>  
> warning:nondeterminism between alts 1 and 2 of block upon
>     k==1:"distinct"
>     k==2:TOK_LPAREN,TOK_PLUS,TOK_MINUS,TOK_DOLLAR,...
> {noformat}
> h3. Root Cause Analysis
> *Lines 574 & 578 (projection rule):*
> The parser cannot distinguish between {{aggregateExpr}} and {{expr}} 
> alternatives when it encounters aggregate function keywords ({{{}sum{}}}, 
> {{{}avg{}}}, {{{}min{}}}, {{{}max{}}}, {{{}count{}}}) because:
>  * These keywords can start an aggregate expression like {{sum(field)}}
>  * The same keywords can also be used as identifiers in regular expressions
>  * Without lookahead, the parser cannot determine which production rule to 
> apply
> *Lines 961 & 979 (aggregateExpr rule):*
> The optional {{distinct}} keyword creates ambiguity:
>  * The parser cannot decide whether to match the optional {{distinct}} 
> keyword or skip it and proceed directly to the expression
>  * This is because both alternatives are valid, and the default behavior 
> doesn't specify preference
> h2. Proposed Solution
> h3. 1. Add Syntactic Predicates (Lines 574 & 578)
> Use ANTLR syntactic predicates to perform lookahead and resolve ambiguity in 
> the {{projection}} rule:
> {code}
> Unable to find source-code formatter for language: antlr. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml// Before:
> ( tok1:aggregateExpr{node = #tok1;} | tok2:expr{node = #tok2;})
> // After:
> ( (("sum"|"avg"|"min"|"max"|"count") TOK_LPAREN)=> tok1:aggregateExpr{node = 
> #tok1;} | tok2:expr{node = #tok2;})
> {code}
> The predicate {{(("sum"|"avg"|"min"|"max"|"count") TOK_LPAREN)=>}} tells the 
> parser:
>  * Look ahead to check if an aggregate keyword is followed by a left 
> parenthesis
>  * If true, choose the {{aggregateExpr}} alternative
>  * Otherwise, choose the {{expr}} alternative
> h3. 2. Add Greedy Option (Lines 961 & 979)
> Use the {{greedy}} option for optional {{distinct}} keywords in the 
> {{aggregateExpr}} rule:
> {code}
> Unable to find source-code formatter for language: antlr. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yaml// Before:
> TOK_LPAREN ("distinct"! {distinctOnly = true;} ) ? tokExpr1:expr TOK_RPAREN
> // After:
> TOK_LPAREN (options {greedy=true;}: "distinct"! {distinctOnly = true;} ) ? 
> tokExpr1:expr TOK_RPAREN
> {code}
> The {{greedy}} option instructs the parser to match the {{distinct}} keyword 
> whenever it appears, removing ambiguity.
> h3. 3. Fix Test Compatibility Issue
> The grammar changes will alter token numbering in the generated lexer. Update 
> {{AbstractCompiledValueTestJUnitTest.java}} to use constants instead of 
> hardcoded token values:
> {code:java}
> // Before:
> new CompiledJunction(new CompiledValue[] {compiledValue1, compiledValue2}, 89)
> // After:
> import org.apache.geode.cache.query.internal.parse.OQLLexerTokenTypes;
> ...
> new CompiledJunction(new CompiledValue[] {compiledValue1, compiledValue2}, 
> OQLLexerTokenTypes.LITERAL_or)
> {code}
> *Rationale:* Adding syntactic predicates changes token numbering (e.g., 
> {{LITERAL_or}} changes from 89 to 94). Using the constant ensures tests 
> remain correct regardless of future grammar modifications.
> h2. Implementation Plan
> h3. Phase 1: Grammar Modifications
>  # Update 
> {{geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g}}
>  # Add syntactic predicates to the {{projection}} rule (2 locations)
>  # Add {{greedy}} option to {{aggregateExpr}} rule (2 locations)
>  # Add inline comments explaining the reasoning for each change
> h3. Phase 2: Test Updates
>  # Update {{AbstractCompiledValueTestJUnitTest.java}}
>  # Add import for {{OQLLexerTokenTypes}}
>  # Replace hardcoded value {{89}} with {{OQLLexerTokenTypes.LITERAL_or}}
>  # Add comments explaining why the constant is used
> h3. Phase 3: Validation
>  # Run {{./gradlew :geode-core:generateGrammarSource}} to verify no warnings
>  # Run {{./gradlew :geode-core:test --tests 
> "org.apache.geode.cache.query.internal.parse.*"}} to verify parser tests pass
>  # Run {{./gradlew :geode-core:test --tests 
> "org.apache.geode.cache.query.internal.Abstract*"}} to verify updated test 
> passes
>  # Run full {{geode-core}} test suite to ensure no regressions
> h2. Expected Outcomes
> h3. Success Criteria
> (/) Zero nondeterminism warnings from ANTLR grammar generation
> (/) All existing query parsing tests pass
> (/) {{AbstractCompiledValueTestJUnitTest}} passes with updated token usage
> (/) No behavior changes to OQL query parsing or execution
> (/) Token numbering is stable and predictable
> h3. Impact Assessment
> *Risk Level:* Low
>  * Changes only affect parser generation, not runtime behavior
>  * Syntactic predicates and greedy options are standard ANTLR features
>  * All test coverage validates correct parsing behavior
> *Performance Impact:* None
>  * Grammar changes only affect compile-time parser generation
>  * No runtime performance impact
> *Compatibility:* Maintained
>  * No changes to OQL syntax or semantics
>  * Fully backward compatible with existing queries
> h2. Files to be Modified
>  # 
> {{geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g}}
>  (Grammar file)
>  # 
> {{geode-core/src/test/java/org/apache/geode/cache/query/internal/AbstractCompiledValueTestJUnitTest.java}}
>  (Test file)
> h2. References
>  * [ANTLR 2 Documentation - Syntactic 
> Predicates|https://www.antlr2.org/doc/predicates.html]
>  * [ANTLR 2 Documentation - Greedy 
> Subrules|https://www.antlr2.org/doc/options.html]
>  * Apache Geode OQL Documentation
> h2. Testing Strategy
> h3. Unit Tests
>  * Verify all query parser tests pass
>  * Verify aggregate function parsing works correctly
>  * Verify distinct keyword handling in aggregate functions
> h3. Integration Tests
>  * Run OQL query tests with aggregate functions
>  * Test queries with projection containing aggregate expressions
>  * Test queries mixing aggregate and non-aggregate projections
> h3. Regression Tests
>  * Run full geode-core test suite
>  * Verify no changes to generated parser behavior
>  * Confirm token numbering stability



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to