[ 
https://issues.apache.org/jira/browse/LUCENE-10361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-10361:
-----------------------------------
    Labels: random-chains  (was: )

> KoreanNumberFilter messes up offsets
> ------------------------------------
>
>                 Key: LUCENE-10361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10361
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Priority: Major
>              Labels: random-chains
>
> It is a tokenfilter, tries to change offsets, so of course TestRandomChains 
> finds bugs in it:
> {noformat}
> NOTE: reproduce with: gradlew test --tests TestRandomChains.testRandomChains 
> -Dtests.seed=12BC606B774693E4 -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.locale=om-Latn-ET -Dtests.timezone=Australia/Yancowinna 
> -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> {noformat}
> {noformat}
> org.apache.lucene.analysis.tests.TestRandomChains > test suite's output saved 
> to 
> /home/rmuir/workspace/lucene/lucene/analysis/integration.tests/build/test-results/test_16/outputs/OUTPUT-org.apache.lucene.analysis.tests.TestRandomChains.txt,
>  copied below:
>   2> stage 0: 뱅<[0-1] +1> Ƒ<[1-2] +1> ė<[3-4] +1> 履<[6-7] +1> jEqyzUT<[8-15] 
> +1>
>   2> stage 1: 000000<[0-1] +1> Ƒ<[1-2] +1> ė<[3-4] +1> 000000<[6-7] +1> 
> 154300<[8-15] +1> 454300<[8-15] +0>
>   2> last stage: 0<[0-1] +1> Ƒ<[1-2] +1> ė<[3-4] +1> 000000<[6-7] +1> 
> 454300<[8-15] +0>
>   2> TEST FAIL: useCharFilter=false 
> text='\ubc45\u0191(\u0117\ud8ad\udf0a\uf9df jEqyzUT '
>   2> Exception from random analyzer:
>   2> charfilters=
>   2>   
> org.apache.lucene.analysis.cjk.CJKWidthCharFilter(java.io.StringReader@17af5384)
>   2>   
> org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@33e5bdbb,
>  org.apache.lucene.analysis.cjk.CJKWidthCharFilter@1aafd271)
>   2> tokenizer=
>   2>   
> org.apache.lucene.analysis.icu.segmentation.ICUTokenizer(org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig@4e6f4690)
>   2> filters=
>   2>   
> Conditional:org.apache.lucene.analysis.phonetic.DaitchMokotoffSoundexFilter(OneTimeWrapper@34215eb7
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,script=Common,
>  false)
>   2>   
> org.apache.lucene.analysis.ko.KoreanNumberFilter(ValidatingTokenFilter@7b4a2a5b
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,script=Common,keyword=false)
>    >     java.lang.IllegalStateException: last stage: inconsistent 
> startOffset at pos=3: 6 vs 8; token=454300
>    >         at 
> __randomizedtesting.SeedInfo.seed([12BC606B774693E4:2F5D490A30548E24]:0)
>    >         at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:138)
>    >         at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:1130)
>    >         at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1028)
>    >         at 
> org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:922)
>    >         at 
> org.apache.lucene.analysis.tests@10.0.0-SNAPSHOT/org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains(TestRandomChains.java:915)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to