[jira] [Commented] (LUCENE-9575) Add PatternTypingFilter

Michael McCandless (Jira) Sun, 24 Jan 2021 05:06:05 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270868#comment-17270868
 ]


Michael McCandless commented on LUCENE-9575:
--------------------------------------------

Hmm, {{gradle precommit}} is upset with the style violations here:
{noformat}
* What went wrong:
Execution failed for task ':lucene:analysis:common:spotlessJavaCheck'.
> The following files had format violations:
      
lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java
          @@ -17,22 +17,22 @@


           package·org.apache.lucene.analysis.pattern;


          +import·java.io.IOException;
          +import·java.util.regex.Matcher;
          +import·java.util.regex.Pattern;
           import·org.apache.lucene.analysis.TokenFilter;
           import·org.apache.lucene.analysis.TokenStream;
           import·org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
           import·org.apache.lucene.analysis.tokenattributes.FlagsAttribute;
           import·org.apache.lucene.analysis.tokenattributes.TypeAttribute;


          -import·java.io.IOException;
          -import·java.util.regex.Matcher;
          -import·java.util.regex.Pattern;
          -
           /**
          
-·*·Set·a·type·attribute·to·a·parameterized·value·when·tokens·are·matched·by·any·of·a·several·regex·patterns.·The
          
-·*·value·set·in·the·type·attribute·is·parameterized·with·the·match·groups·of·the·regex·used·for·matching.
          
-·*·In·combination·with·TypeAsSynonymFilter·and·DropIfFlagged·filter·this·can·supply·complex·synonym·patterns
          
-·*·that·are·protected·from·subsequent·analysis,·and·optionally·drop·the·original·term·based·on·the·flag
          
-·*·set·in·this·filter.·See·{@link·PatternTypingFilterFactory}·for·full·documentation.
          
+·*·Set·a·type·attribute·to·a·parameterized·value·when·tokens·are·matched·by·any·of·a·several·regex
          
+·*·patterns.·The·value·set·in·the·type·attribute·is·parameterized·with·the·match·groups·of·the·regex
          
+·*·used·for·matching.·In·combination·with·TypeAsSynonymFilter·and·DropIfFlagged·filter·this·can
          
+·*·supply·complex·synonym·patterns·that·are·protected·from·subsequent·analysis,·and·optionally·drop
          
+·*·the·original·term·based·on·the·flag·set·in·this·filter.·See·{@link·PatternTypingFilterFactory}
          +·*·for·full·documentation.
           ·*
           ·*·@see·PatternTypingFilterFactory
           ·*·@since·8.8.0
          @@ -44,7 +44,7 @@
           
··private·final·FlagsAttribute·flagAtt·=·addAttribute(FlagsAttribute.class);
           
··private·final·TypeAttribute·typeAtt·=·addAttribute(TypeAttribute.class);


          
-··public·PatternTypingFilter(TokenStream·input,··PatternTypingRule...·replacementAndFlagByPattern)·{
          
+··public·PatternTypingFilter(TokenStream·input,·PatternTypingRule...·replacementAndFlagByPattern)·{
           ····super(input);
           ····this.replacementAndFlagByPattern·=·replacementAndFlagByPattern;
           ··}
          @@ -55,7 +55,8 @@
           ······for·(PatternTypingRule·rule·:·replacementAndFlagByPattern)·{
           ········Matcher·matcher·=·rule.getPattern().matcher(termAtt);
           ········if·(matcher.find())·{
          
-··········//·allow·2nd·reset()·and·find()·that·occurs·inside·replaceFirst·to·avoid·excess·string·creation
          
+··········//·allow·2nd·reset()·and·find()·that·occurs·inside·replaceFirst·to·avoid·excess·string
          +··········//·creation
           
··········typeAtt.setType(matcher.replaceFirst(rule.getTypeTemplate()));
      ... (13 more lines that didn't fit)
  Violations also present in:
      
lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilterFactory.java
      
lucene/analysis/common/src/test/org/apache/lucene/analysis/pattern/TestPatternTypingFilter.java
      
lucene/analysis/common/src/test/org/apache/lucene/analysis/pattern/TestPatternTypingFilterFactory.java
  Run './gradlew :lucene:analysis:common:spotlessApply' to fix these 
violations. {noformat}
and the changes from {{gradle tidy}} or {{gradle 
:lucene:analysis:common:spotlessApply}} look OK to me ... I'll push a fix soon 
if nobody beats me to it.

> Add PatternTypingFilter
> -----------------------
>
>                 Key: LUCENE-9575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9575
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Gus Heck
>            Assignee: Gus Heck
>            Priority: Major
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> One of the key asks when the Library of Congress was asking me to develop the 
> Advanced Query Parser was to be able to recognize arbitrary patterns that 
> included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they 
> wanted 401k and 401(k) to match documents with either style reference, and 
> NOT match documents that happen to have isolated 401 or k tokens (i.e. not 
> documents about the http status code) And of course we wanted to give up as 
> little of the text analysis features they were already using.
> This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and 
> one solr specific filter in SOLR-14597 that re-analyzes tokens with an 
> arbitrary analyzer defined for a type in the solr schema, combine to achieve 
> this. 
> This filter has the job of spotting the patterns, and adding the intended 
> synonym as at type to the token (from which minimal punctuation has been 
> removed). It also sets flags on the token which are retained through the 
> analysis chain, and at the very end the type is converted to a synonym and 
> the original token(s) for that type are dropped avoiding the match on 401 
> (for example) 
> The pattern matching is specified in a file that looks like: 
> {code}
> 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
> 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
> 2 C\+\+ ::: c_plus_plus
> {code}
> That file would match match legal reference patterns such as 401(k), 401k, 
> 501(c)3 and C++ The format is:
> <flagsInt> <pattern> ::: <replacement>
> and groups in the pattern are substituted into the replacement so the first 
> line above would create synonyms such as:
> {code}
> 401k   --> legal2_401_k
> 401(k) --> legal2_401_k
> 503(c) --> legal2_503_c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9575) Add PatternTypingFilter

Reply via email to