jetzerv opened a new issue, #14624:
URL: https://github.com/apache/lucene/issues/14624

   ### Description
   
   The `HyphenationCompoundWordTokenFilter` is the recommended decompounder for 
Germanic languages, recommended by Elastic [Elasticsearch 
Docs](https://www.elastic.co/guide/en/elasticsearch/reference/8.18/analysis-hyp-decomp-tokenfilter.html).
 
   Although the decompounding doesn't work as expected for my use case. Let me 
explain with an example:
   
   1. The user searches for 'Sommerkleid' in a webshop (German for Summer dress)
   2. Decompounding the word 'Sommerkleid' will return 'sommerkleid', 'sommer' 
and 'kleid'. (The positions of all 3 tokens are `position: 0`)
   3. Since all tokens are on position 0, the customer gets products that 
container 'sommer' OR 'kleid' OR 'sommerkleid', although the customer was 
searching for both and not either terms. Leading to random products that are 
not a 'kleid', but are categorized as 'sommer' products.
   
   Ideally there would be two extra properties to;
   1. exclude initial token from output (default false for backwards 
compatibility)
   2. increase position for split tokens ('sommer' would be pos: 0, 'kleid' 
would be pos: 1)
   
   Would this be possible to add? I already saw a related issue from 5 years 
ago -> https://github.com/apache/lucene/issues/10625, although was not 
implemented back then.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to