[ https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271184#comment-17271184 ]
Gus Heck commented on LUCENE-9575: ---------------------------------- ah thanks, though I was waiting on tests in github for [https://github.com/apache/lucene-solr/pull/2240] > Add PatternTypingFilter > ----------------------- > > Key: LUCENE-9575 > URL: https://issues.apache.org/jira/browse/LUCENE-9575 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Gus Heck > Assignee: Gus Heck > Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > One of the key asks when the Library of Congress was asking me to develop the > Advanced Query Parser was to be able to recognize arbitrary patterns that > included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they > wanted 401k and 401(k) to match documents with either style reference, and > NOT match documents that happen to have isolated 401 or k tokens (i.e. not > documents about the http status code) And of course we wanted to give up as > little of the text analysis features they were already using. > This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and > one solr specific filter in SOLR-14597 that re-analyzes tokens with an > arbitrary analyzer defined for a type in the solr schema, combine to achieve > this. > This filter has the job of spotting the patterns, and adding the intended > synonym as at type to the token (from which minimal punctuation has been > removed). It also sets flags on the token which are retained through the > analysis chain, and at the very end the type is converted to a synonym and > the original token(s) for that type are dropped avoiding the match on 401 > (for example) > The pattern matching is specified in a file that looks like: > {code} > 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2 > 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3 > 2 C\+\+ ::: c_plus_plus > {code} > That file would match match legal reference patterns such as 401(k), 401k, > 501(c)3 and C++ The format is: > <flagsInt> <pattern> ::: <replacement> > and groups in the pattern are substituted into the replacement so the first > line above would create synonyms such as: > {code} > 401k --> legal2_401_k > 401(k) --> legal2_401_k > 503(c) --> legal2_503_c > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org