Re: Re: Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - audrey.lorberf...@ibm.com Wed, 04 Sep 2019 07:14:35 -0700

Thanks, Alex! We'll look into this.

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com


On 9/3/19, 4:27 PM, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote:

    What about combining:
    1) KeywordRepeatFilterFactory
    2) An existing folding filter (need to check it ignores Keyword marked word)
    3) RemoveDuplicatesTokenFilterFactory
    
    That may give what you are after without custom coding.
    
    Regards,
       Alex.
    
    On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld -
    audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
    >
    > Toke,
    >
    > Thank you! That makes a lot of sense.
    >
    > In other news -- we just had a meeting where we decided to try out a 
hybrid strategy. I'd love to know what you & everyone else thinks...
    >
    > - Since we are concerned with the overhead created by "double-fielding" 
all tokens per language (because I'm not sure how we'd work the logic into Solr 
to only double-field when an accent is present), we are going to try to do 
something along the lines of synonym-expansion:
    >         - We are going to build a custom plugin that detects diacritics 
-- upon detection, the plugin would expand the token to both its original form 
and its ascii-folded term (a la Toke's approach).
    >         - However, since we are doing it in a way that mimics synonym 
expansion, we are going to keep both terms in a single field
    >
    > The main issue we are anticipating with the above strategy surrounds 
scoring. Since we will be increasing the frequency of accented terms, we might 
bias our page ranker...
    >
    > Has anyone done anything similar (and/or does anyone think this idea is 
totally the wrong way to go?)
    >
    > Best,
    > Audrey
    >
    > --
    > Audrey Lorberfeld
    > Data Scientist, w3 Search
    > IBM
    > audrey.lorberf...@ibm.com
    >
    >
    > On 9/3/19, 2:58 PM, "Toke Eskildsen" <t...@kb.dk> wrote:
    >
    >     Audrey Lorberfeld - audrey.lorberf...@ibm.com 
<audrey.lorberf...@ibm.com> wrote:
    >     > Do you find that searching over both the original title field and 
the normalized title
    >     > field increases the time it takes for your search engine to 
retrieve results?
    >
    >     It is not something we have measured as that index is fast enough 
(which in this context means that we're practically always waiting for the 
result from an external service that is issued in parallel with the call to our 
Solr server).
    >
    >     Technically it's not different from searching across other fields 
defined in the eDismax setup, so I guess it boils down to "how many fields can 
you afford to search across?", where our organization's default answer is "as 
many as we need to get quality matches. Make it work Toke, chop chop". On a 
more serious note, it is not something I would worry about unless we're talking 
some special high-performance setup with a budget for tuning: Matching terms 
and joining filters is core Solr (Lucene really) functionality. Plain query & 
filter-matching time tend to be dwarfed by aggregations (grouping, faceting, 
stats).
    >
    >     - Toke Eskildsen
    >
    >

Re: Re: Re: Re: Multi-lingual Search & Accent Marks

Reply via email to