Re: Multi-lingual Search & Accent Marks

Walter Underwood Fri, 30 Aug 2019 13:09:21 -0700

The right transliteration for accents is language-dependent. In English, a 
diaeresis can be stripped because it is only used to mark neighboring vowels as 
independently pronounced. In German, the “typewriter umlaut” adds an “e”.


English: coöperate -> cooperate
German: Glück -> Glueck

Some stemmers will handle the typewriter umlauts for you. The InXight stemmers 
used to do that.

The English diaeresis is a fussy usage, but it does occur in text. For years, 
MS Word corrected “naive” to “naïve”. There may even be a curse associated with 
its usage.

https://www.newyorker.com/culture/culture-desk/the-curse-of-the-diaeresis

In German, there are corner cases where just stripping the umlaut changes one 
word into another, like schön/schon.

Isn’t language fun?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 30, 2019, at 12:48 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> It Depends (tm). In this case on how sophisticated/precise your users are. If 
> your users are exclusively extremely conversant in the language and are 
> expected to have keyboards that allow easy access to all the accents… then I 
> might leave them in. In some cases removing them can change the meaning of a 
> word.
> 
> That said, most installations I’ve seen remove them. They’re still present in 
> any returned stored field so the doc looks good. And then you bypass all the 
> nonsense about perhaps ingesting a doc that “somehow” had accents removed 
> and/or people not putting accents in their search and the like.
> 
> MappingCFF works..
> 
>> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
>> <audrey.lorberf...@ibm.com> wrote:
>> 
>> Aita,
>> 
>> Thanks for that insight! 
>> 
>> As the conversation has progressed, we are now leaning towards not having 
>> the ASCII-folding filter in our pipelines in order to keep marks like 
>> umlauts and tildas. Instead, we might add acute and grave accents to a file 
>> pointed at by the MappingCharFilterFactory to simply strip those more common 
>> accent marks...
>> 
>> Any other opinions are welcome!
>> 
>> -- 
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> audrey.lorberf...@ibm.com
>> 
>> 
>> On 8/30/19, 10:27 AM, "Atita Arora" <atitaar...@gmail.com> wrote:
>> 
>>   We work on german index, we neutralize accents before index i.e. umlauts to
>>   'ae', 'ue'.. Etc and similar what we do at the query time too for an
>>   appropriate match.
>> 
>>   On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
>>   <audrey.lorberf...@ibm.com> wrote:
>> 
>>> Hi All,
>>> 
>>> Just wanting to test the waters here – for those of you with search
>>> engines that index multiple languages, do you use ASCII-folding in your
>>> schema? We are onboarding Spanish documents into our index right now and
>>> keep going back and forth on whether we should preserve accent marks. From
>>> our query logs, it seems people generally do not include accents when
>>> searching, but you never know…
>>> 
>>> Thank you in advance for sharing your experiences!
>>> 
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> Digital Workplace Engineering
>>> CIO, Finance and Operations
>>> IBM
>>> audrey.lorberf...@ibm.com
>>> 
>>> 
>> 
>> 
>

Re: Multi-lingual Search & Accent Marks

Reply via email to