Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Walter Underwood
On Sep 3, 2019, at 1:13 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > > The main issue we are anticipating with the above strategy surrounds scoring. > Since we will be increasing the frequency of accented terms, we might bias > our page ranker... You will not be increasing the f

Re: Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thanks, Alex! We'll look into this. -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 9/3/19, 4:27 PM, "Alexandre Rafalovitch" wrote: What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Alexandre Rafalovitch
What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword marked word) 3) RemoveDuplicatesTokenFilterFactory That may give what you are after without custom coding. Regards, Alex. On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld - audrey

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke, Thank you! That makes a lot of sense. In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks... - Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Toke Eskildsen
Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Do you find that searching over both the original title field and the > normalized title > field increases the time it takes for your search engine to retrieve results? It is not something we have measured as that index is fast enough (which

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke, Do you find that searching over both the original title field and the normalized title field increases the time it takes for your search engine to retrieve results? -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Languages are the best. Thank you all so much! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 4:09 PM, "Walter Underwood" wrote: The right transliteration for accents is language-dependen

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thank you, Erick! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 3:49 PM, "Erick Erickson" wrote: It Depends (tm). In this case on how sophisticated/precise your users are. If your users

Re: Multi-lingual Search & Accent Marks

2019-08-31 Thread Walter Underwood
> On Aug 31, 2019, at 12:00 PM, Toke Eskildsen wrote: > > Whenever we do this normalisation, we index two versions in our index: A very > lightly normalised (lowercased) field and a heavily normalised field: If a > record has a title "Köket" (kitchen in Swedish), we store title_orig:köket > an

Re: Multi-lingual Search & Accent Marks

2019-08-31 Thread Toke Eskildsen
Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Just wanting to test the waters here – for those of you with search engines > that index multiple languages, do you use ASCII-folding in your schema? Our primary search engine is for Danish users, with sources being bibliographic records wit

Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Walter Underwood
The right transliteration for accents is language-dependent. In English, a diaeresis can be stripped because it is only used to mark neighboring vowels as independently pronounced. In German, the “typewriter umlaut” adds an “e”. English: coöperate -> cooperate German: Glück -> Glueck Some stemm

Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Erick Erickson
It Depends (tm). In this case on how sophisticated/precise your users are. If your users are exclusively extremely conversant in the language and are expected to have keyboards that allow easy access to all the accents… then I might leave them in. In some cases removing them can change the meani

Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Aita, Thanks for that insight! As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory

Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Atita Arora
We work on german index, we neutralize accents before index i.e. umlauts to 'ae', 'ue'.. Etc and similar what we do at the query time too for an appropriate match. On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Hi All, > > Just wanting to test the waters her