Re: Accented search

Walter Underwood Tue, 11 Mar 2008 09:51:07 -0700

Generally, the accented version will have a higher IDF, so it
will score higher.


wunder

On 3/11/08 8:44 AM, "Renaud Waldura" <[EMAIL PROTECTED]>
wrote:

> Peter:
> 
> Very interesting. To take care of the issue you mention, could you add
> multiple "synonyms" with progressively less accents?
> 
> E.g. you'd index "préférence" as 4 tokens:
>  préférence (unchanged)
>  preférence (stripped one accent)
>  préference (stripped the other accent)
>  preference (stripped both accents)
> 
> Or does it yield too many tokens to be useful?
> 
> And how does this take care of scoring? Do you get a higher score with a
> closer match?
> 
> 
>  
> 
> -----Original Message-----
> From: Binkley, Peter [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 11, 2008 8:37 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Accented search
> 
> We've done this in a pre-Solr Lucene context by using the position
> increment: when a token contains accented characters, you add a stripped
> version of that token with a zero increment, so that for matching purposes
> the original and the stripped version are at the same position. Accents are
> not stripped from queries. The effect is that an accented search matches
> your Doc A, and an unaccented search matches Docs A and B. We do that after
> lower-casing the token.
> 
> There are some limitations: users might start to expect that they can freely
> add accents to restrict their search to accented hits, but if they don't
> match the accents exactly they won't get any hits: e.g. if a word contains
> two accented characters and the user only accents one of them in their
> query, they won't match the accented or the unaccented version.
> 
> Peter
> 
> Peter Binkley
> Digital Initiatives Technology Librarian Information Technology Services
> 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta
> Canada T6G 2J8
> Phone: (780) 492-3743
> Fax: (780) 492-9243
> e-mail: [EMAIL PROTECTED]
> 
> ~ The code is willing, but the data is weak. ~
> 
> 
> -----Original Message-----
> From: climbingrose [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 10, 2008 10:01 PM
> To: solr-user@lucene.apache.org
> Subject: Accented search
> 
> Hi guys,
> 
> I'm running to some problems with accented (UTF-8) language. I'd love to
> hear some ideas about how to use Solr with those languages. Basically, I
> want to achieve what Google did with UTF-8 language.
> 
> My requirements including:
> 1) Accent insensitive search and proper highlighting:
>   For example, we have 2 documents:
> 
>   Doc A (title:L?p Trình Viên)
>   Doc B (title:Lap Trinh Vien)
> 
>   if the user enters "L?p Trình Viên", then Doc B is also matched and "L?p
> Trình Viên" is highlighted.
>   On the other hand, if the query is "Lap Trinh Vien", Doc A is also
> matched.
> 2) Assign proper scores to accented or non-accented searches:
>   if the user enters "L?p Trình Viên", then Doc A should be given higher
> score than DOC B.
>   if the query is "Lap Trinh Vien", Doc A should be given higher score.
> 
> Any ideas guys? Thanks in advance!
> 
> --
> Regards,
> 
> Cuong Hoang
> 
>

Re: Accented search

Reply via email to