Re: Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - audrey.lorberf...@ibm.com Tue, 03 Sep 2019 13:14:44 -0700

Toke,

Thank you! That makes a lot of sense.


In other news -- we just had a meeting where we decided to try out a hybrid 
strategy. I'd love to know what you & everyone else thinks...

- Since we are concerned with the overhead created by "double-fielding" all 
tokens per language (because I'm not sure how we'd work the logic into Solr to 
only double-field when an accent is present), we are going to try to do 
something along the lines of synonym-expansion:
        - We are going to build a custom plugin that detects diacritics -- upon 
detection, the plugin would expand the token to both its original form and its 
ascii-folded term (a la Toke's approach).
        - However, since we are doing it in a way that mimics synonym 
expansion, we are going to keep both terms in a single field

The main issue we are anticipating with the above strategy surrounds scoring. 
Since we will be increasing the frequency of accented terms, we might bias our 
page ranker...

Has anyone done anything similar (and/or does anyone think this idea is totally 
the wrong way to go?)

Best,
Audrey

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/3/19, 2:58 PM, "Toke Eskildsen" <t...@kb.dk> wrote:

    Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> 
wrote:
    > Do you find that searching over both the original title field and the 
normalized title
    > field increases the time it takes for your search engine to retrieve 
results?
    
    It is not something we have measured as that index is fast enough (which in 
this context means that we're practically always waiting for the result from an 
external service that is issued in parallel with the call to our Solr server).
    
    Technically it's not different from searching across other fields defined 
in the eDismax setup, so I guess it boils down to "how many fields can you 
afford to search across?", where our organization's default answer is "as many 
as we need to get quality matches. Make it work Toke, chop chop". On a more 
serious note, it is not something I would worry about unless we're talking some 
special high-performance setup with a budget for tuning: Matching terms and 
joining filters is core Solr (Lucene really) functionality. Plain query & 
filter-matching time tend to be dwarfed by aggregations (grouping, faceting, 
stats).
    
    - Toke Eskildsen

Re: Re: Re: Multi-lingual Search & Accent Marks

Reply via email to