Generally, the accented version will have a higher IDF, so it will score higher.
wunder On 3/11/08 8:44 AM, "Renaud Waldura" <[EMAIL PROTECTED]> wrote: > Peter: > > Very interesting. To take care of the issue you mention, could you add > multiple "synonyms" with progressively less accents? > > E.g. you'd index "préférence" as 4 tokens: > préférence (unchanged) > preférence (stripped one accent) > préference (stripped the other accent) > preference (stripped both accents) > > Or does it yield too many tokens to be useful? > > And how does this take care of scoring? Do you get a higher score with a > closer match? > > > > > -----Original Message----- > From: Binkley, Peter [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 11, 2008 8:37 AM > To: solr-user@lucene.apache.org > Subject: RE: Accented search > > We've done this in a pre-Solr Lucene context by using the position > increment: when a token contains accented characters, you add a stripped > version of that token with a zero increment, so that for matching purposes > the original and the stripped version are at the same position. Accents are > not stripped from queries. The effect is that an accented search matches > your Doc A, and an unaccented search matches Docs A and B. We do that after > lower-casing the token. > > There are some limitations: users might start to expect that they can freely > add accents to restrict their search to accented hits, but if they don't > match the accents exactly they won't get any hits: e.g. if a word contains > two accented characters and the user only accents one of them in their > query, they won't match the accented or the unaccented version. > > Peter > > Peter Binkley > Digital Initiatives Technology Librarian Information Technology Services > 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta > Canada T6G 2J8 > Phone: (780) 492-3743 > Fax: (780) 492-9243 > e-mail: [EMAIL PROTECTED] > > ~ The code is willing, but the data is weak. ~ > > > -----Original Message----- > From: climbingrose [mailto:[EMAIL PROTECTED] > Sent: Monday, March 10, 2008 10:01 PM > To: solr-user@lucene.apache.org > Subject: Accented search > > Hi guys, > > I'm running to some problems with accented (UTF-8) language. I'd love to > hear some ideas about how to use Solr with those languages. Basically, I > want to achieve what Google did with UTF-8 language. > > My requirements including: > 1) Accent insensitive search and proper highlighting: > For example, we have 2 documents: > > Doc A (title:L?p Trình Viên) > Doc B (title:Lap Trinh Vien) > > if the user enters "L?p Trình Viên", then Doc B is also matched and "L?p > Trình Viên" is highlighted. > On the other hand, if the query is "Lap Trinh Vien", Doc A is also > matched. > 2) Assign proper scores to accented or non-accented searches: > if the user enters "L?p Trình Viên", then Doc A should be given higher > score than DOC B. > if the query is "Lap Trinh Vien", Doc A should be given higher score. > > Any ideas guys? Thanks in advance! > > -- > Regards, > > Cuong Hoang > >