jetzerv opened a new issue, #14624: URL: https://github.com/apache/lucene/issues/14624
### Description The `HyphenationCompoundWordTokenFilter` is the recommended decompounder for Germanic languages, recommended by Elastic [Elasticsearch Docs](https://www.elastic.co/guide/en/elasticsearch/reference/8.18/analysis-hyp-decomp-tokenfilter.html). Although the decompounding doesn't work as expected for my use case. Let me explain with an example: 1. The user searches for 'Sommerkleid' in a webshop (German for Summer dress) 2. Decompounding the word 'Sommerkleid' will return 'sommerkleid', 'sommer' and 'kleid'. (The positions of all 3 tokens are `position: 0`) 3. Since all tokens are on position 0, the customer gets products that container 'sommer' OR 'kleid' OR 'sommerkleid', although the customer was searching for both and not either terms. Leading to random products that are not a 'kleid', but are categorized as 'sommer' products. Ideally there would be two extra properties to; 1. exclude initial token from output (default false for backwards compatibility) 2. increase position for split tokens ('sommer' would be pos: 0, 'kleid' would be pos: 1) Would this be possible to add? I already saw a related issue from 5 years ago -> https://github.com/apache/lucene/issues/10625, although was not implemented back then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org