[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224604#comment-17224604 ]
Martin Demberger commented on LUCENE-8183: ------------------------------------------ Hi Uwe, is there any way I can fasten the merge. We have a few customer which wait for this fix because our search very often falls into this pithole. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > ------------------------------------------------------------------------------------------- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.6 > Environment: Configuration of the analyzer: > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.HyphenationCompoundWordTokenFilterFactory" > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > > Reporter: Rupert Westenthaler > Assignee: Uwe Schindler > Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, > LUCENE-8183_20180227_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org