[ https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270868#comment-17270868 ]
Michael McCandless commented on LUCENE-9575: -------------------------------------------- Hmm, {{gradle precommit}} is upset with the style violations here: {noformat} * What went wrong: Execution failed for task ':lucene:analysis:common:spotlessJavaCheck'. > The following files had format violations: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilter.java @@ -17,22 +17,22 @@ package·org.apache.lucene.analysis.pattern; +import·java.io.IOException; +import·java.util.regex.Matcher; +import·java.util.regex.Pattern; import·org.apache.lucene.analysis.TokenFilter; import·org.apache.lucene.analysis.TokenStream; import·org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import·org.apache.lucene.analysis.tokenattributes.FlagsAttribute; import·org.apache.lucene.analysis.tokenattributes.TypeAttribute; -import·java.io.IOException; -import·java.util.regex.Matcher; -import·java.util.regex.Pattern; - /** -·*·Set·a·type·attribute·to·a·parameterized·value·when·tokens·are·matched·by·any·of·a·several·regex·patterns.·The -·*·value·set·in·the·type·attribute·is·parameterized·with·the·match·groups·of·the·regex·used·for·matching. -·*·In·combination·with·TypeAsSynonymFilter·and·DropIfFlagged·filter·this·can·supply·complex·synonym·patterns -·*·that·are·protected·from·subsequent·analysis,·and·optionally·drop·the·original·term·based·on·the·flag -·*·set·in·this·filter.·See·{@link·PatternTypingFilterFactory}·for·full·documentation. +·*·Set·a·type·attribute·to·a·parameterized·value·when·tokens·are·matched·by·any·of·a·several·regex +·*·patterns.·The·value·set·in·the·type·attribute·is·parameterized·with·the·match·groups·of·the·regex +·*·used·for·matching.·In·combination·with·TypeAsSynonymFilter·and·DropIfFlagged·filter·this·can +·*·supply·complex·synonym·patterns·that·are·protected·from·subsequent·analysis,·and·optionally·drop +·*·the·original·term·based·on·the·flag·set·in·this·filter.·See·{@link·PatternTypingFilterFactory} +·*·for·full·documentation. ·* ·*·@see·PatternTypingFilterFactory ·*·@since·8.8.0 @@ -44,7 +44,7 @@ ··private·final·FlagsAttribute·flagAtt·=·addAttribute(FlagsAttribute.class); ··private·final·TypeAttribute·typeAtt·=·addAttribute(TypeAttribute.class); -··public·PatternTypingFilter(TokenStream·input,··PatternTypingRule...·replacementAndFlagByPattern)·{ +··public·PatternTypingFilter(TokenStream·input,·PatternTypingRule...·replacementAndFlagByPattern)·{ ····super(input); ····this.replacementAndFlagByPattern·=·replacementAndFlagByPattern; ··} @@ -55,7 +55,8 @@ ······for·(PatternTypingRule·rule·:·replacementAndFlagByPattern)·{ ········Matcher·matcher·=·rule.getPattern().matcher(termAtt); ········if·(matcher.find())·{ -··········//·allow·2nd·reset()·and·find()·that·occurs·inside·replaceFirst·to·avoid·excess·string·creation +··········//·allow·2nd·reset()·and·find()·that·occurs·inside·replaceFirst·to·avoid·excess·string +··········//·creation ··········typeAtt.setType(matcher.replaceFirst(rule.getTypeTemplate())); ... (13 more lines that didn't fit) Violations also present in: lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTypingFilterFactory.java lucene/analysis/common/src/test/org/apache/lucene/analysis/pattern/TestPatternTypingFilter.java lucene/analysis/common/src/test/org/apache/lucene/analysis/pattern/TestPatternTypingFilterFactory.java Run './gradlew :lucene:analysis:common:spotlessApply' to fix these violations. {noformat} and the changes from {{gradle tidy}} or {{gradle :lucene:analysis:common:spotlessApply}} look OK to me ... I'll push a fix soon if nobody beats me to it. > Add PatternTypingFilter > ----------------------- > > Key: LUCENE-9575 > URL: https://issues.apache.org/jira/browse/LUCENE-9575 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Gus Heck > Assignee: Gus Heck > Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > One of the key asks when the Library of Congress was asking me to develop the > Advanced Query Parser was to be able to recognize arbitrary patterns that > included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they > wanted 401k and 401(k) to match documents with either style reference, and > NOT match documents that happen to have isolated 401 or k tokens (i.e. not > documents about the http status code) And of course we wanted to give up as > little of the text analysis features they were already using. > This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and > one solr specific filter in SOLR-14597 that re-analyzes tokens with an > arbitrary analyzer defined for a type in the solr schema, combine to achieve > this. > This filter has the job of spotting the patterns, and adding the intended > synonym as at type to the token (from which minimal punctuation has been > removed). It also sets flags on the token which are retained through the > analysis chain, and at the very end the type is converted to a synonym and > the original token(s) for that type are dropped avoiding the match on 401 > (for example) > The pattern matching is specified in a file that looks like: > {code} > 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2 > 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3 > 2 C\+\+ ::: c_plus_plus > {code} > That file would match match legal reference patterns such as 401(k), 401k, > 501(c)3 and C++ The format is: > <flagsInt> <pattern> ::: <replacement> > and groups in the pattern are substituted into the replacement so the first > line above would create synonyms such as: > {code} > 401k --> legal2_401_k > 401(k) --> legal2_401_k > 503(c) --> legal2_503_c > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org