Thanks, Alex! We'll look into this. -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com
On 9/3/19, 4:27 PM, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote: What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword marked word) 3) RemoveDuplicatesTokenFilterFactory That may give what you are after without custom coding. Regards, Alex. On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > Toke, > > Thank you! That makes a lot of sense. > > In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks... > > - Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not sure how we'd work the logic into Solr to only double-field when an accent is present), we are going to try to do something along the lines of synonym-expansion: > - We are going to build a custom plugin that detects diacritics -- upon detection, the plugin would expand the token to both its original form and its ascii-folded term (a la Toke's approach). > - However, since we are doing it in a way that mimics synonym expansion, we are going to keep both terms in a single field > > The main issue we are anticipating with the above strategy surrounds scoring. Since we will be increasing the frequency of accented terms, we might bias our page ranker... > > Has anyone done anything similar (and/or does anyone think this idea is totally the wrong way to go?) > > Best, > Audrey > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 9/3/19, 2:58 PM, "Toke Eskildsen" <t...@kb.dk> wrote: > > Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > Do you find that searching over both the original title field and the normalized title > > field increases the time it takes for your search engine to retrieve results? > > It is not something we have measured as that index is fast enough (which in this context means that we're practically always waiting for the result from an external service that is issued in parallel with the call to our Solr server). > > Technically it's not different from searching across other fields defined in the eDismax setup, so I guess it boils down to "how many fields can you afford to search across?", where our organization's default answer is "as many as we need to get quality matches. Make it work Toke, chop chop". On a more serious note, it is not something I would worry about unless we're talking some special high-performance setup with a budget for tuning: Matching terms and joining filters is core Solr (Lucene really) functionality. Plain query & filter-matching time tend to be dwarfed by aggregations (grouping, faceting, stats). > > - Toke Eskildsen > >