[jira] [Updated] (LUCENE-10363) JapaneseCompletionFilter messes up offsets

Uwe Schindler (Jira) Wed, 05 Jan 2022 08:53:04 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-10363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-10363:
-----------------------------------
    Labels: random-chains  (was: )

> JapaneseCompletionFilter messes up offsets
> ------------------------------------------
>
>                 Key: LUCENE-10363
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10363
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Priority: Major
>              Labels: random-chains
>
> It is a tokenfilter, tries to change offsets, so of course TestRandomChains 
> finds bugs in it:
> {noformat}
> NOTE: reproduce with: gradlew test --tests 
> TestRandomChains.testRandomChainsWithLargeStrings 
> -Dtests.seed=E233A5FAC016E02 -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.locale=en-TV -Dtests.timezone=Asia/Saigon -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> {noformat}
> {noformat}
> org.apache.lucene.analysis.tests.TestRandomChains > test suite's output saved 
> to 
> /home/rmuir/workspace/lucene/lucene/analysis/integration.tests/build/test-results/test_54/outputs/OUTPUT-org.apache.lucene.analysis.tests.TestRandomChains.txt,
>  copied below:
>   2> stage 0: lk<[1-3] +1> p<[6-7] +1> ngtoixtmldzsjz<[10-24] +1> uoq<[25-28] 
> +1> HANGUL<[28-28] +1> o<[29-30] +1> HANGUL<[31-31] +1> VulliPHsZzn<[32-43] 
> +1>
>   2> stage 1: lk<[1-3] +1> 850000<[1-3] +0> p<[6-7] +1> 700000<[6-7] +0> 
> ngtoixtmldzsjz<[10-24] +1> 653543<[10-24] +0> uoq<[25-28] +1> 050000<[25-28] 
> +0> HANGUL<[28-28] +1> 565800<[28-28] +0> o<[29-30] +1> 000000<[29-30] +0> 
> HANGUL<[31-31] +1> 565800<[31-31] +0> VulliPHsZzn<[32-43] +1> 787460<[32-43] 
> +0>
>   2> stage 2: ngtoixtmldzsjz 653543<[10-24] +0> 653543<[10-24] +1> 653543 
> uoq<[10-28] +0> uoq<[25-28] +1> uoq 050000<[25-28] +0> 050000<[25-28] +1> 
> 050000 HANGUL<[25-28] +0> HANGUL<[28-28] +1> HANGUL 565800<[28-28] +0> 
> 565800<[28-28] +1> 565800 o<[28-30] +0> o<[29-30] +1> o 000000<[29-30] +0> 
> 000000<[29-30] +1> 000000 HANGUL<[29-31] +0> HANGUL<[31-31] +1> HANGUL 
> 565800<[31-31] +0> 565800<[31-31] +1> 565800 VulliPHsZzn<[31-43] +0> 
> VulliPHsZzn<[32-43] +1>
>   2> last stage: ngtoixtmldzsjz<[10-24] +1> ngtoixtmldzsjz 653543<[10-24] +0> 
> 653543<[10-24] +1> 653543 uoq<[10-28] +0> uoq<[25-28] +1> uoq 050000<[25-28] 
> +1> 050000<[25-28] +1> 050000 HANGUL<[25-28] +1> HANGUL<[28-28] +1> HANGUL 
> 565800<[28-28] +0> 565800<[28-28] +1> 565800 o<[28-30] +0> o<[29-30] +1> o 
> 000000<[29-30] +0> 000000<[29-30] +1> 000000 HANGUL<[29-31] +0> 
> HANGUL<[31-31] +1> HANGUL 565800<[31-31] +1> 565800<[31-31] +1> 565800 
> VulliPHsZzn<[31-43] +0>
>   2> TEST FAIL: useCharFilter=true text='[lk[-.p|) ngtoixtmldzsjz uoqao 
> aVulliPHsZzn wxsk'
>   2> Exception from random analyzer:
>   2> charfilters=
>   2>   org.apache.lucene.analysis.pattern.PatternReplaceCharFilter(a, 
> <HANGUL>, java.io.StringReader@5b3b54eb)
>   2> tokenizer=
>   2>   
> org.apache.lucene.analysis.classic.ClassicTokenizer(org.apache.lucene.util.AttributeFactory$1@e29311e9)
>   2> filters=
>   2>   
> org.apache.lucene.analysis.phonetic.DaitchMokotoffSoundexFilter(ValidatingTokenFilter@32a6de77
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
>  true)
>   2>   
> org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@3d044414
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
>  q)
>   2>   
> Conditional:org.apache.lucene.analysis.ja.JapaneseCompletionFilter(OneTimeWrapper@435207ec
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,reading=null,reading
>  (en)=null,pronunciation=null,pronunciation (en)=null, INDEX)
>    >     java.lang.IllegalStateException: last stage: inconsistent endOffset 
> at pos=19: 31 vs 43; token=565800 VulliPHsZzn
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10363) JapaneseCompletionFilter messes up offsets

Reply via email to