Re: [I] HyphenationCompoundWordTokenFilter fixed token position and preserves original token [lucene]

via GitHub Tue, 20 May 2025 06:42:09 -0700


mikemccand commented on issue #14624:
URL: https://github.com/apache/lucene/issues/14624#issuecomment-2894460764


   To address your 2nd idea (increment the position for each sub-word in the 
compound word), I think we'd need to create a graph-aware 
`CompoundWordTokenFilter`.  It would also emit `PositionLengthAttribute`, and 
would correctly express that your original token spanned two positions, and 
`sommer` was at position 0, `kleid` at position 1, and `Sommerkleid` at 
position 0 but spanning two positions.
   
   We do have a graph-aware synonym filter (`SynonymGraphFilter`) ... I wonder 
if we could enhance that to accept a `HyphenationTree`?  Or maybe we could 
rewrite the German compounding rules as synonyms and use `SynonymGraphFilter` 
directly?
   
   My [long-ago blog post talks about understanding Lucene's TokenStreams as 
graphs](https://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html),
 but not all TokenStreams create a graph.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] HyphenationCompoundWordTokenFilter fixed token position and preserves original token [lucene]

Reply via email to