[jira] [Commented] (LUCENE-9575) Add PatternTypingFilter

Michael McCandless (Jira) Sun, 24 Jan 2021 11:58:06 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270978#comment-17270978
 ]


Michael McCandless commented on LUCENE-9575:
--------------------------------------------

Hmm {{TestRandomChains}} is angry here:
{noformat}
sis.core.TestRandomChains.txt, copied below:
   >     java.lang.AssertionError: public 
org.apache.lucene.analysis.pattern.PatternTypingFilter(org.apache.lucene.analysis.TokenStream,org.apache.lucene.analysis.pattern.PatternTypi\
ngFilter$PatternTypingRule...) has unsupported parameter types
   >         at __randomizedtesting.SeedInfo.seed([65EA739C95F40313]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.assertTrue(Assert.java:42)
   >         at 
org.apache.lucene.analysis.core.TestRandomChains.beforeClass(TestRandomChains.java:263)
   >         at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
   >         at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:564)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:882)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
   >         at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   >         at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:51)
   >         at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
   >         at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
   >         at java.base/java.lang.Thread.run(Thread.java:832)
  2> NOTE: test params are: codec=Asserting(Lucene90): {}, docValues:{}, 
maxPointsInLeafNode=737, maxMBSortInHeap=7.862736725727331, 
sim=Asserting(RandomSimilarity(queryNorm=true): {\
}), locale=uz-Latn, timezone=America/Chicago
  2> NOTE: Linux 5.9.8-arch1-1 amd64/Oracle Corporation 15.0.1 
(64-bit)/cpus=128,threads=1,free=126051336,total=270532608
  2> NOTE: All tests run in this JVM: [TestCommonGramsQueryFilterFactory, 
TestRandomChains]
  2> NOTE: reproduce with: gradlew test --tests TestRandomChains 
-Dtests.seed=65EA739C95F40313 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=uz-Latn -Dtests.timezone=Americ\
a/Chicago -Dtests.asserts=true -Dtests.file.encoding=UTF-8{noformat}

> Add PatternTypingFilter
> -----------------------
>
>                 Key: LUCENE-9575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9575
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Gus Heck
>            Assignee: Gus Heck
>            Priority: Major
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> One of the key asks when the Library of Congress was asking me to develop the 
> Advanced Query Parser was to be able to recognize arbitrary patterns that 
> included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they 
> wanted 401k and 401(k) to match documents with either style reference, and 
> NOT match documents that happen to have isolated 401 or k tokens (i.e. not 
> documents about the http status code) And of course we wanted to give up as 
> little of the text analysis features they were already using.
> This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and 
> one solr specific filter in SOLR-14597 that re-analyzes tokens with an 
> arbitrary analyzer defined for a type in the solr schema, combine to achieve 
> this. 
> This filter has the job of spotting the patterns, and adding the intended 
> synonym as at type to the token (from which minimal punctuation has been 
> removed). It also sets flags on the token which are retained through the 
> analysis chain, and at the very end the type is converted to a synonym and 
> the original token(s) for that type are dropped avoiding the match on 401 
> (for example) 
> The pattern matching is specified in a file that looks like: 
> {code}
> 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
> 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
> 2 C\+\+ ::: c_plus_plus
> {code}
> That file would match match legal reference patterns such as 401(k), 401k, 
> 501(c)3 and C++ The format is:
> <flagsInt> <pattern> ::: <replacement>
> and groups in the pattern are substituted into the replacement so the first 
> line above would create synonyms such as:
> {code}
> 401k   --> legal2_401_k
> 401(k) --> legal2_401_k
> 503(c) --> legal2_503_c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9575) Add PatternTypingFilter

Reply via email to