[jira] [Updated] (LUCENE-4983) CommonGramsFilter assumes all input tokens have a length of 1

Uwe Schindler (Jira) Wed, 05 Jan 2022 08:58:06 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-4983:
----------------------------------
    Labels: random-chains  (was: )

> CommonGramsFilter assumes all input tokens have a length of 1
> -------------------------------------------------------------
>
>                 Key: LUCENE-4983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4983
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Major
>              Labels: random-chains
>
> CommonGramsFilter set posLenAttribute to 2 for bi-grams, no matter the length 
> of the input tokens. Here is an example seed that produces a failure:
> {noformat}
> [junit4:junit4] <JUnit4> says Привет! Master seed: 3296009A5B3B7A05
> [junit4:junit4] Executing 1 suite with 1 JVM.
> [junit4:junit4] 
> [junit4:junit4] Started J0 PID(23946@RD-38).
> [junit4:junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
> [junit4:junit4]   2> TEST FAIL: useCharFilter=true text='apuqdgtr wjco  mpc '
> [junit4:junit4]   2> Exception from random analyzer: 
> [junit4:junit4]   2> charfilters=
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.pattern.PatternReplaceCharFilter(a, <APOSTROPHE>, 
> java.io.StringReader@699f982d)
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(org.apache.lucene.analysis.pattern.PatternReplaceCharFilter@6cbfe887,
>  [<NUM>])
> [junit4:junit4]   2> tokenizer=
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.core.LetterTokenizer(LUCENE_44, 
> org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory@4fb3c3d9,
>  
> org.apache.lucene.analysis.core.TestRandomChains$CheckThatYouDidntReadAnythingReaderWrapper@2b3b2ed8)
> [junit4:junit4]   2> filters=
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.util.ElisionFilter(org.apache.lucene.analysis.ValidatingTokenFilter@0,
>  [iez])
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.MockGraphTokenFilter(java.util.Random@3a807d14, 
> org.apache.lucene.analysis.ValidatingTokenFilter@20)
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.commongrams.CommonGramsFilter(LUCENE_44, 
> org.apache.lucene.analysis.ValidatingTokenFilter@37caea, [bbtzjxco, , 
> jafehvlp, kujsm, znpfw, xqfni])
> [junit4:junit4]   2>   
> org.apache.lucene.analysis.bg.BulgarianStemFilter(org.apache.lucene.analysis.ValidatingTokenFilter@6c1927b)
> [junit4:junit4]   2> offsetsAreCorrect=true
> [junit4:junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
> -Dtests.seed=3296009A5B3B7A05 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=ar_YE -Dtests.timezone=Europe/London 
> -Dtests.file.encoding=US-ASCII
> [junit4:junit4] ERROR   14.7s | TestRandomChains.testRandomChains <<<
> [junit4:junit4]    > Throwable #1: java.lang.IllegalStateException: stage 3: 
> inconsistent endOffset at pos=2: 13 vs 8; token=￯[㑮ٯb_
> [junit4:junit4]    >  at 
> __randomizedtesting.SeedInfo.seed([3296009A5B3B7A05:F7729FB1C2967C5]:0)
> [junit4:junit4]    >  at 
> org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:135)
> [junit4:junit4]    >  at 
> org.apache.lucene.analysis.bg.BulgarianStemFilter.incrementToken(BulgarianStemFilter.java:48)
> [junit4:junit4]    >  at 
> org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:78)
> [junit4:junit4]    >  at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:635)
> [junit4:junit4]    >  at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
> [junit4:junit4]    >  at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:447)
> [junit4:junit4]    >  at 
> org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:944)
> [junit4:junit4]    >  at java.lang.Thread.run(Thread.java:679)
> [junit4:junit4]   2> NOTE: test params are: codec=Lucene42: 
> {dummy=PostingsFormat(name=Direct)}, docValues:{}, sim=DefaultSimilarity, 
> locale=ar_YE, timezone=Europe/London
> [junit4:junit4]   2> NOTE: Linux 3.5.0-27-generic amd64/Sun Microsystems Inc. 
> 1.6.0_27 (64-bit)/cpus=2,threads=1,free=96085824,total=223412224
> [junit4:junit4]   2> NOTE: All tests run in this JVM: [TestRandomChains]
> [junit4:junit4] Completed in 16.32s, 1 test, 1 error <<< FAILURES!
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4983) CommonGramsFilter assumes all input tokens have a length of 1

Reply via email to