[I] HyphenationCompoundWordTokenFilter fixed token position and preserves original token [lucene]

via GitHub Wed, 07 May 2025 07:00:10 -0700


jetzerv opened a new issue, #14624:
URL: https://github.com/apache/lucene/issues/14624

### Description

The `HyphenationCompoundWordTokenFilter` is the recommended decompounder for
Germanic languages, recommended by Elastic [Elasticsearch
Docs](https://www.elastic.co/guide/en/elasticsearch/reference/8.18/analysis-hyp-decomp-tokenfilter.html).

Although the decompounding doesn't work as expected for my use case. Let me
explain with an example:

1. The user searches for 'Sommerkleid' in a webshop (German for Summer dress)
2. Decompounding the word 'Sommerkleid' will return 'sommerkleid', 'sommer'
and 'kleid'. (The positions of all 3 tokens are `position: 0`)
3. Since all tokens are on position 0, the customer gets products that
container 'sommer' OR 'kleid' OR 'sommerkleid', although the customer was
searching for both and not either terms. Leading to random products that are
not a 'kleid', but are categorized as 'sommer' products.

Ideally there would be two extra properties to;
1. exclude initial token from output (default false for backwards
compatibility)
2. increase position for split tokens ('sommer' would be pos: 0, 'kleid'
would be pos: 1)

Would this be possible to add? I already saw a related issue from 5 years
ago -> https://github.com/apache/lucene/issues/10625, although was not
implemented back then.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] HyphenationCompoundWordTokenFilter fixed token position and preserves original token [lucene]

Reply via email to