[ 
https://issues.apache.org/jira/browse/LUCENE-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-10360:
-----------------------------------
    Labels: random-chains  (was: )

> BeiderMorseFilter: TestRandomChains fails with IndexOutOfBounds on empty term 
> text
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-10360
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10360
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Uwe Schindler
>            Priority: Major
>              Labels: random-chains
>
> Error seen:
> {noformat}
>   2> TEST FAIL: useCharFilter=true text='Uf?F ?wlu{0 <!--'a'
>   2> Exception from random analyzer:
>   2> charfilters=
>   2> tokenizer=
>   2>   
> org.apache.lucene.analysis.ja.JapaneseTokenizer(org.apache.lucene.util.AttributeFactory$1@4c00d592,
>  null, false, true, NORMAL)
>   2> filters=
>   2>   
> Conditional:org.apache.lucene.analysis.pt.PortugueseLightStemFilter(OneTimeWrapper@3fad923e
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech
>  (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation 
> (en)=null,inflectionType=null,inflectionType 
> (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false)
>   2>   
> org.apache.lucene.analysis.phonetic.BeiderMorseFilter(ValidatingTokenFilter@43fbbeb0
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech
>  (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation 
> (en)=null,inflectionType=null,inflectionType 
> (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, 
> org.apache.commons.codec.language.bm.PhoneticEngine@631e916d)
>   2>   
> Conditional:org.apache.lucene.analysis.synonym.SynonymGraphFilter(OneTimeWrapper@77051976
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech
>  (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation 
> (en)=null,inflectionType=null,inflectionType 
> (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, 
> org.apache.lucene.analysis.synonym.SynonymMap@69152718, true)
>    >     java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for 
> length 0
>    >         at 
> __randomizedtesting.SeedInfo.seed([1E22B4EE8663AE48:23C39D8FC171B388]:0)
>    >         at 
> org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:433)
>    >         at 
> org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:384)
>    >         at 
> org.apache.lucene.analysis.phonetic@10.0.0-SNAPSHOT/org.apache.lucene.analysis.phonetic.BeiderMorseFilter.incrementToken(BeiderMorseFilter.java:96)
> {noformat}
> Actually the issue happens if:
> - PhoneticEngine uses NameType=SEPHARDIC
> - The term is empty or the cleanup done by the encode is empty (whitespace 
> and dashes removed)
> The problem is that the encoder calls String.split() and assumes the array 
> always has size>=1.
> You can write an easy test, but the bug has to be reported upstream.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to