The problem with the wildcard searches is that the input is not
analyzed. For english, this might not be such a problem (except if you
expect case insenstive search). But than again, you don't get that with
like, either. Ngrams bring that and more.

What I think is often forgotten when comparing 'like' and Solr search
is:
Solr's analyzer allow not only for case insenstive search but also for
other analysis such as removing diacritics and this is also applied when
sorting (you have to create a separate index in the DB, as well, if you
want that).

Say you have the following names:
'Van Hinden'
'van Hinden'
'Música'
'Musil'

like 'mu%' - no hits
like 'Mu%' - 1 hit
like 'van%' - 1 hit
like 'hin%' - no hits

with Solr whitespace or standard tokenizer, ngrams and a diacritcs and
lowercase filter (no wildcard search):
'mu'/'Mu' - 2 hits sorted ignoring case and diacritics
'van' - 2 hits
'hin' - 2 hits


(This is written down from experience. I haven't checked those examples
explicitly.)

Cheers,
Chantal



On Fri, 2011-12-30 at 02:00 +0100, Chris Hostetter wrote:
> : Thanks. I know I'll be able to utilize some of Solr's free text 
> : searching capabilities in other search types in this project. The 
> : product manager wants this particular search to exactly mimic LIKE%.
>       ...
> : Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
> : rather than having a low score.
> 
> please be specific about the types of queries you want. ie: we need more 
> then one example of the type of input you want to provide, the type of 
> matches you want to see for that input, and the type of matches you want 
> to get back.
> 
> in your first message you said you need to match company titles "pretty 
> exactly" but then seem to contradict yourself by saying the SQL's LIKE 
> command fit's the bill -- even though the SQL LIKE command exists 
> specificly for in-exact matches on field values.
> 
> Based on your one example above of Albatross, you don't need anything 
> special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
> just search for "Albatross" and it will match "Albatross" but not 
> "Albert".  if you want "Albatross" to match "Albatross Road" use some 
> basic tokenization.
> 
> If all you really care about is prefix searching (which seems suggested by 
> your "LIKE%" comment above, which i'm guessing is shorthand for something 
> similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
> match "abcdef" and "abcdzzzz" but neither of them match "xxxxabcdyyyy" 
> then just use prefix queries (ie: "abcd*") -- they should be plenty 
> efficient for your purposes.  you only need to worry about ngrams when you 
> want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
> %ABC%")
> 
> 
> -Hoss

Reply via email to