[ https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270978#comment-17270978 ]
Michael McCandless commented on LUCENE-9575: -------------------------------------------- Hmm {{TestRandomChains}} is angry here: {noformat} sis.core.TestRandomChains.txt, copied below: > java.lang.AssertionError: public org.apache.lucene.analysis.pattern.PatternTypingFilter(org.apache.lucene.analysis.TokenStream,org.apache.lucene.analysis.pattern.PatternTypi\ ngFilter$PatternTypingRule...) has unsupported parameter types > at __randomizedtesting.SeedInfo.seed([65EA739C95F40313]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.assertTrue(Assert.java:42) > at org.apache.lucene.analysis.core.TestRandomChains.beforeClass(TestRandomChains.java:263) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:564) > at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:882) > at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898) > at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) > at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:51) > at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) > at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) > at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) > at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826) > at java.base/java.lang.Thread.run(Thread.java:832) 2> NOTE: test params are: codec=Asserting(Lucene90): {}, docValues:{}, maxPointsInLeafNode=737, maxMBSortInHeap=7.862736725727331, sim=Asserting(RandomSimilarity(queryNorm=true): {\ }), locale=uz-Latn, timezone=America/Chicago 2> NOTE: Linux 5.9.8-arch1-1 amd64/Oracle Corporation 15.0.1 (64-bit)/cpus=128,threads=1,free=126051336,total=270532608 2> NOTE: All tests run in this JVM: [TestCommonGramsQueryFilterFactory, TestRandomChains] 2> NOTE: reproduce with: gradlew test --tests TestRandomChains -Dtests.seed=65EA739C95F40313 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=uz-Latn -Dtests.timezone=Americ\ a/Chicago -Dtests.asserts=true -Dtests.file.encoding=UTF-8{noformat} > Add PatternTypingFilter > ----------------------- > > Key: LUCENE-9575 > URL: https://issues.apache.org/jira/browse/LUCENE-9575 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Gus Heck > Assignee: Gus Heck > Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > One of the key asks when the Library of Congress was asking me to develop the > Advanced Query Parser was to be able to recognize arbitrary patterns that > included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they > wanted 401k and 401(k) to match documents with either style reference, and > NOT match documents that happen to have isolated 401 or k tokens (i.e. not > documents about the http status code) And of course we wanted to give up as > little of the text analysis features they were already using. > This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and > one solr specific filter in SOLR-14597 that re-analyzes tokens with an > arbitrary analyzer defined for a type in the solr schema, combine to achieve > this. > This filter has the job of spotting the patterns, and adding the intended > synonym as at type to the token (from which minimal punctuation has been > removed). It also sets flags on the token which are retained through the > analysis chain, and at the very end the type is converted to a synonym and > the original token(s) for that type are dropped avoiding the match on 401 > (for example) > The pattern matching is specified in a file that looks like: > {code} > 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2 > 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3 > 2 C\+\+ ::: c_plus_plus > {code} > That file would match match legal reference patterns such as 401(k), 401k, > 501(c)3 and C++ The format is: > <flagsInt> <pattern> ::: <replacement> > and groups in the pattern are substituted into the replacement so the first > line above would create synonyms such as: > {code} > 401k --> legal2_401_k > 401(k) --> legal2_401_k > 503(c) --> legal2_503_c > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org