Re: [I] DictionaryCompoundWordTokenFilter should respect minSubwordSize also for fragments [LUCENE-6809] [lucene]

via GitHub Thu, 20 Feb 2025 13:22:53 -0800


renatoh commented on issue #7867:
URL: https://github.com/apache/lucene/issues/7867#issuecomment-2672713650


   In my opinion, the root issue is that DictionaryCompoundWordTokenFilter is 
not consuming the characters of a found word.
   As an example: The German word Schweinefleisch (literally translates to pig 
meat), after finding the word Schwein, we should not reuse the character from 
"Schwein" and only process from the remaining token "efleisch". From the 
remainder we then would extract fleisch. I think the best result would be 
achieve by sorting the dictionary by the length of the word and start with the 
longest, once a token is found we consume the matching chars and process the 
rest.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] DictionaryCompoundWordTokenFilter should respect minSubwordSize also for fragments [LUCENE-6809] [lucene]

Reply via email to