Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Walter Underwood
On Sep 3, 2019, at 1:13 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > > The main issue we are anticipating with the above strategy surrounds scoring. > Since we will be increasing the frequency of accented terms, we might bias > our page ranker... You will not be increasing the f

Re: Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thanks, Alex! We'll look into this. -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 9/3/19, 4:27 PM, "Alexandre Rafalovitch" wrote: What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Alexandre Rafalovitch
What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword marked word) 3) RemoveDuplicatesTokenFilterFactory That may give what you are after without custom coding. Regards, Alex. On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld - audrey

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke, Thank you! That makes a lot of sense. In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks... - Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Toke Eskildsen
Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Do you find that searching over both the original title field and the > normalized title > field increases the time it takes for your search engine to retrieve results? It is not something we have measured as that index is fast enough (which

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke, Do you find that searching over both the original title field and the normalized title field increases the time it takes for your search engine to retrieve results? -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Languages are the best. Thank you all so much! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 4:09 PM, "Walter Underwood" wrote: The right transliteration for accents is language-dependen

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thank you, Erick! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 3:49 PM, "Erick Erickson" wrote: It Depends (tm). In this case on how sophisticated/precise your users are. If your users

Re: Multi-lingual Search & Accent Marks

2019-08-31 Thread Walter Underwood
> On Aug 31, 2019, at 12:00 PM, Toke Eskildsen wrote: > > Whenever we do this normalisation, we index two versions in our index: A very > lightly normalised (lowercased) field and a heavily normalised field: If a > record has a title "Köket" (kitchen in Swedish), we store title_orig:köket > an

Re: Multi-lingual Search & Accent Marks

2019-08-31 Thread Toke Eskildsen
Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Just wanting to test the waters here – for those of you with search engines > that index multiple languages, do you use ASCII-folding in your schema? Our primary search engine is for Danish users, with sources being bibliographic records wit

Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Walter Underwood
The right transliteration for accents is language-dependent. In English, a diaeresis can be stripped because it is only used to mark neighboring vowels as independently pronounced. In German, the “typewriter umlaut” adds an “e”. English: coöperate -> cooperate German: Glück -> Glueck Some stemm

Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Erick Erickson
It Depends (tm). In this case on how sophisticated/precise your users are. If your users are exclusively extremely conversant in the language and are expected to have keyboards that allow easy access to all the accents… then I might leave them in. In some cases removing them can change the meani

Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Aita, Thanks for that insight! As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory

Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Atita Arora
We work on german index, we neutralize accents before index i.e. umlauts to 'ae', 'ue'.. Etc and similar what we do at the query time too for an appropriate match. On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Hi All, > > Just wanting to test the waters her

Re: Multi-lingual search

2016-02-09 Thread Modassar Ather
And what does proximity search exactly mean? A proximity search means searching terms with a distance in between them. E.g. Search for a document which has java near 3 words of network. field:"java network"~3 So the above query will match any document having a distance of 3 by its position between

RE: Multi-lingual search

2016-02-08 Thread vidya
Hi Can i implement proximity search if i use >seperate core per language >field per language >multilingual field that supports all languages. And what does proximity search exactly mean? searching for walk word when walking is indexed,should fetch and display the record? It will be included i

RE: Multi-lingual search

2016-02-08 Thread vidya
Hi I need to search on these languages which includes proximity search also. 1.Malay 2.Tamil 3.Bahasa Indonesia 4.Vietnamese 5.Cantonese Will IndicNormalizationFilter work fine or any other filter? Help me if you have already worked on it or have any idea. Thanks in advance -- View this mess

RE: Multi-lingual search

2016-02-02 Thread Allison, Timothy B.
Three basic options: 1) one generic field that handles non-whitespace languages and normalization robustly (downside: no language specific stopwords, stemming, etc) 2) one field per language (hope lang id works and that you don't have many multilingual docs) 3) one Solr core for language (ditto)

Re: Multi-lingual search

2016-02-02 Thread Scott Stults
The IndicNormalizationFilter appears to work with Tamil. Is it not working for you? k/r, Scott On Mon, Feb 1, 2016 at 8:34 AM, vidya wrote: > Hi > > My use case is to index and able to query different languages in solr > which > are not in-built languages supported by solr. How can i implemen