mikemccand commented on issue #14624: URL: https://github.com/apache/lucene/issues/14624#issuecomment-2894460764
To address your 2nd idea (increment the position for each sub-word in the compound word), I think we'd need to create a graph-aware `CompoundWordTokenFilter`. It would also emit `PositionLengthAttribute`, and would correctly express that your original token spanned two positions, and `sommer` was at position 0, `kleid` at position 1, and `Sommerkleid` at position 0 but spanning two positions. We do have a graph-aware synonym filter (`SynonymGraphFilter`) ... I wonder if we could enhance that to accept a `HyphenationTree`? Or maybe we could rewrite the German compounding rules as synonyms and use `SynonymGraphFilter` directly? My [long-ago blog post talks about understanding Lucene's TokenStreams as graphs](https://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html), but not all TokenStreams create a graph. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org