The right transliteration for accents is language-dependent. In English, a diaeresis can be stripped because it is only used to mark neighboring vowels as independently pronounced. In German, the “typewriter umlaut” adds an “e”.
English: coöperate -> cooperate German: Glück -> Glueck Some stemmers will handle the typewriter umlauts for you. The InXight stemmers used to do that. The English diaeresis is a fussy usage, but it does occur in text. For years, MS Word corrected “naive” to “naïve”. There may even be a curse associated with its usage. https://www.newyorker.com/culture/culture-desk/the-curse-of-the-diaeresis In German, there are corner cases where just stripping the umlaut changes one word into another, like schön/schon. Isn’t language fun? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 30, 2019, at 12:48 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > It Depends (tm). In this case on how sophisticated/precise your users are. If > your users are exclusively extremely conversant in the language and are > expected to have keyboards that allow easy access to all the accents… then I > might leave them in. In some cases removing them can change the meaning of a > word. > > That said, most installations I’ve seen remove them. They’re still present in > any returned stored field so the doc looks good. And then you bypass all the > nonsense about perhaps ingesting a doc that “somehow” had accents removed > and/or people not putting accents in their search and the like. > > MappingCFF works.. > >> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com >> <audrey.lorberf...@ibm.com> wrote: >> >> Aita, >> >> Thanks for that insight! >> >> As the conversation has progressed, we are now leaning towards not having >> the ASCII-folding filter in our pipelines in order to keep marks like >> umlauts and tildas. Instead, we might add acute and grave accents to a file >> pointed at by the MappingCharFilterFactory to simply strip those more common >> accent marks... >> >> Any other opinions are welcome! >> >> -- >> Audrey Lorberfeld >> Data Scientist, w3 Search >> Digital Workplace Engineering >> CIO, Finance and Operations >> IBM >> audrey.lorberf...@ibm.com >> >> >> On 8/30/19, 10:27 AM, "Atita Arora" <atitaar...@gmail.com> wrote: >> >> We work on german index, we neutralize accents before index i.e. umlauts to >> 'ae', 'ue'.. Etc and similar what we do at the query time too for an >> appropriate match. >> >> On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com >> <audrey.lorberf...@ibm.com> wrote: >> >>> Hi All, >>> >>> Just wanting to test the waters here – for those of you with search >>> engines that index multiple languages, do you use ASCII-folding in your >>> schema? We are onboarding Spanish documents into our index right now and >>> keep going back and forth on whether we should preserve accent marks. From >>> our query logs, it seems people generally do not include accents when >>> searching, but you never know… >>> >>> Thank you in advance for sharing your experiences! >>> >>> -- >>> Audrey Lorberfeld >>> Data Scientist, w3 Search >>> Digital Workplace Engineering >>> CIO, Finance and Operations >>> IBM >>> audrey.lorberf...@ibm.com >>> >>> >> >> >