subject:"Re\: Multi\-lingual Search \& Accent Marks"

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Walter Underwood

On Sep 3, 2019, at 1:13 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > > The main issue we are anticipating with the above strategy surrounds scoring. > Since we will be increasing the frequency of accented terms, we might bias > our page ranker... You will not be increasing the f

Re: Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com

Thanks, Alex! We'll look into this. -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 9/3/19, 4:27 PM, "Alexandre Rafalovitch" wrote: What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Alexandre Rafalovitch

What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword marked word) 3) RemoveDuplicatesTokenFilterFactory That may give what you are after without custom coding. Regards, Alex. On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld - audrey

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com

Toke, Thank you! That makes a lot of sense. In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks... - Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Toke Eskildsen

Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Do you find that searching over both the original title field and the > normalized title > field increases the time it takes for your search engine to retrieve results? It is not something we have measured as that index is fast enough (which

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com

Toke, Do you find that searching over both the original title field and the normalized title field increases the time it takes for your search engine to retrieve results? -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com

Languages are the best. Thank you all so much! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 4:09 PM, "Walter Underwood" wrote: The right transliteration for accents is language-dependen

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com

Thank you, Erick! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 3:49 PM, "Erick Erickson" wrote: It Depends (tm). In this case on how sophisticated/precise your users are. If your users

Re: Multi-lingual Search & Accent Marks

2019-08-31 Thread Walter Underwood

> On Aug 31, 2019, at 12:00 PM, Toke Eskildsen wrote: > > Whenever we do this normalisation, we index two versions in our index: A very > lightly normalised (lowercased) field and a heavily normalised field: If a > record has a title "Köket" (kitchen in Swedish), we store title_orig:köket > an

Re: Multi-lingual Search & Accent Marks

2019-08-31 Thread Toke Eskildsen

Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Just wanting to test the waters here – for those of you with search engines > that index multiple languages, do you use ASCII-folding in your schema? Our primary search engine is for Danish users, with sources being bibliographic records wit

Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Walter Underwood

The right transliteration for accents is language-dependent. In English, a diaeresis can be stripped because it is only used to mark neighboring vowels as independently pronounced. In German, the “typewriter umlaut” adds an “e”. English: coöperate -> cooperate German: Glück -> Glueck Some stemm

Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Erick Erickson

It Depends (tm). In this case on how sophisticated/precise your users are. If your users are exclusively extremely conversant in the language and are expected to have keyboards that allow easy access to all the accents… then I might leave them in. In some cases removing them can change the meani

Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com

Aita, Thanks for that insight! As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory

Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Atita Arora

We work on german index, we neutralize accents before index i.e. umlauts to 'ae', 'ue'.. Etc and similar what we do at the query time too for an appropriate match. On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Hi All, > > Just wanting to test the waters her

Re: Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Re: Multi-lingual Search & Accent Marks

Re: Multi-lingual Search & Accent Marks

Re: Multi-lingual Search & Accent Marks

Re: Multi-lingual Search & Accent Marks

Re: Re: Multi-lingual Search & Accent Marks

Re: Re: Multi-lingual Search & Accent Marks

Re: Multi-lingual Search & Accent Marks

14 matches

Site Navigation

Mail list logo

Footer information