renatoh commented on issue #7867: URL: https://github.com/apache/lucene/issues/7867#issuecomment-2672713650
In my opinion, the root issue is that DictionaryCompoundWordTokenFilter is not consuming the characters of a found word. As an example: The German word Schweinefleisch (literally translates to pig meat), after finding the word Schwein, we should not reuse the character from "Schwein" and only process from the remaining token "efleisch". From the remainder we then would extract fleisch. I think the best result would be achieve by sorting the dictionary by the length of the word and start with the longest, once a token is found we consume the matching chars and process the rest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org