[ https://issues.apache.org/jira/browse/LUCENE-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated LUCENE-10360: ----------------------------------- Labels: random-chains (was: ) > BeiderMorseFilter: TestRandomChains fails with IndexOutOfBounds on empty term > text > ---------------------------------------------------------------------------------- > > Key: LUCENE-10360 > URL: https://issues.apache.org/jira/browse/LUCENE-10360 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Reporter: Uwe Schindler > Priority: Major > Labels: random-chains > > Error seen: > {noformat} > 2> TEST FAIL: useCharFilter=true text='Uf?F ?wlu{0 <!--'a' > 2> Exception from random analyzer: > 2> charfilters= > 2> tokenizer= > 2> > org.apache.lucene.analysis.ja.JapaneseTokenizer(org.apache.lucene.util.AttributeFactory$1@4c00d592, > null, false, true, NORMAL) > 2> filters= > 2> > Conditional:org.apache.lucene.analysis.pt.PortugueseLightStemFilter(OneTimeWrapper@3fad923e > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech > (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation > (en)=null,inflectionType=null,inflectionType > (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false) > 2> > org.apache.lucene.analysis.phonetic.BeiderMorseFilter(ValidatingTokenFilter@43fbbeb0 > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech > (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation > (en)=null,inflectionType=null,inflectionType > (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, > org.apache.commons.codec.language.bm.PhoneticEngine@631e916d) > 2> > Conditional:org.apache.lucene.analysis.synonym.SynonymGraphFilter(OneTimeWrapper@77051976 > > term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech > (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation > (en)=null,inflectionType=null,inflectionType > (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, > org.apache.lucene.analysis.synonym.SynonymMap@69152718, true) > > java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for > length 0 > > at > __randomizedtesting.SeedInfo.seed([1E22B4EE8663AE48:23C39D8FC171B388]:0) > > at > org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:433) > > at > org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:384) > > at > org.apache.lucene.analysis.phonetic@10.0.0-SNAPSHOT/org.apache.lucene.analysis.phonetic.BeiderMorseFilter.incrementToken(BeiderMorseFilter.java:96) > {noformat} > Actually the issue happens if: > - PhoneticEngine uses NameType=SEPHARDIC > - The term is empty or the cleanup done by the encode is empty (whitespace > and dashes removed) > The problem is that the encoder calls String.split() and assumes the array > always has size>=1. > You can write an easy test, but the bug has to be reported upstream. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org