Re: Wildcards and fuzzy/phonetic query

Erick Erickson Mon, 08 Oct 2012 12:35:33 -0700

To answer your first question, yes, you've got it right. If you define
a multiterm section in your fieldType, whatever you put in that section
gets applied whether the underlying class is MultiTermAware or not.
Which means you can shoot yourself in the foot really bad <G>...


Well, you have 6 or so possibilities out of the box...and all of them will
fail at times. Fuzzy searches will also fail at times. And so will most
anything else you try. The problem is these are algorithmic in nature
and there are just too many cases that don't fit, human language is
so endlessly variable....

Whether Middle Eastern names will work well with phonetic filters, well,
what's the input language? Are you indexing English (or Norwegian or...)
translations? In that case things should work OK since the phonetic variations
should be accounted for in the translations.

If you're indexing in different languages, you can apply different
phonetic filters
on different fields, so you might be able to work it that way. But if you're
indexing multiple languages in to a _single_ field, you'll have a lot of other
problems to solve before you start worrying about phonetics...

All I can really say is give it a try and see how well it works since "good"
search results are so domain dependent....

Fuzzy searches + wildcards. I don't think you can do that reasonably, but
I'm not entirely sure.

Best
Erick

On Mon, Oct 8, 2012 at 2:28 PM, Hågen Pihlstrøm Hasle
<haagenha...@gmail.com> wrote:
>
> I understand that I'm quickly reaching the boundaries of my Solr-competence 
> when I'm supposed to read about "Expert Level" concepts.. :)  I had already 
> read it once, but now I read it again. Twice.  And I'm not sure if I 
> understand it correctly..  So let me ask a follow-up question:
> If I define an analyzer of type multiterm, will every filter I include for 
> that analyzer be applied, even if it's not MultiTermAware?
>
> To complicate this further, I'm not really sure if phonetic filters is a good 
> match for our needs.  We search for names, and these names can come from all 
> over the world.  We use DoubleMetaphone, and Wikipedia says it "tries to 
> account for myriad irregularities in English of Slavic, Germanic, Celtic, 
> Greek, French, Italian, Spanish, Chinese, and other origin".  So I guess it's 
> quite good.  But how about names from the middle east, Pakistan or India?  Is 
> DoubleMetaphone a good match also for names from these countries?  Are there 
> any better algorithms?
>
> How about fuzzy-searches and wildcards, are they impossible to combine?
>
> We actually do three queries for every search, one fuzzy, one phonetic and 
> one using ngram.  Because I don't have too much confidence in the phonetic 
> algorithm, I would really like to be able to combine fuzzy queries with 
> wildcards.. :)
>
>
> Regards, Hågen
>
>
> On Oct 8, 2012, at 6:09 PM, Erick Erickson wrote:
>
>> whether phonetic filters can be multiterm aware:
>>
>> I'd be leery of this, as I basically don't quite know how that would
>> behave. You'd have to insure that the  algorithms changed the
>> first parts of the words uniformly, regardless of what followed. I'm
>> pretty sure that _some_ phonetic algorithms do not follow this
>> pattern, i.e. eric wouldn't necessarily have the same beginning
>> as erickson. That said, some of the algorithms _may_ follow this
>> rule and might be OK candidates for being MultiTermAware....
>>
>> But, you don't need this in order to try it out. See the "Expert Level
>> Schema Possibilities"
>> at:
>> http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
>>
>> You can define your own analysis chain for wildcards as part of your 
>> <fieldType>
>> definition and include whatever you want, whether or not it's
>> MultiTermAware and it
>> will be applied at query time. Use the <analyzer type="query"> entry
>> as a basis. _But_ you shouldn't include anything in this section that
>> produces more than one output per input token. Note, "token", not
>> "field". I.e. a really bad candidate for this section is
>> WordDelimiterFilterFactory
>> if you use the admin/analysis page (which you'll get to know intimately) and
>> look at a type that has WordDelimiterFilterFactory in its chain and
>> put something
>> like erickErickson1234, you'll see what I mean.. Make sure and check the
>> "verbose" box....
>>
>> If you can determine that some of the phonetic algorithms _should_ be
>> MultiTermAware, please feel free to raise a JIRA and we can discuss... I 
>> suspect
>> it'll be on a case-by-case basis.
>>
>> Best
>> Erick
>>
>> On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
>> <haagenha...@gmail.com> wrote:
>>> Hi!
>>>
>>> I'm quite new to Solr, I was recently asked to help out on a project where 
>>> the previous "Solr-person" quit quite suddenly.  I've noticed that some of 
>>> our searches don't return the expected result, and I'm hoping you guys can 
>>> help me out.
>>>
>>> We've indexed a lot of names, and would like to search for a person in our 
>>> system using these names.  We previously used Oracle Text for this, and we 
>>> experience that Solr is much faster.  So far so good! :)  But when we try 
>>> to use wildcards things start to to wrong.
>>>
>>> We're using Solr 3.4, and I see that some of our problems are solved in 
>>> 3.6.  Ref SOLR-2438:
>>> https://issues.apache.org/jira/browse/SOLR-2438
>>>
>>> But we would also like to be able to combine wildcards with fuzzy searches, 
>>> and wildcards with a phonetic filter.  I don't see anything about phonetic 
>>> filters in SOLR-2438 or SOLR-2921.  
>>> (https://issues.apache.org/jira/browse/SOLR-2921)
>>> Is it possible to make the phonetic filters MultiTermAware?
>>>
>>> Regarding fuzzy queries, in Oracle Text I can search for "chr%" ("chr*" in 
>>> Solr..) and find both christian and kristian.  As far as I understand, this 
>>> is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  
>>> Is this correct, or have I misunderstood anything?  Are there any 
>>> workarounds or filter-combinations I can use to achieve the same result?  
>>> I've seen people suggest using a boolean query to combine the two, but I 
>>> don't really see how that would solve my "chr*"-problem.
>>>
>>> As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
>>> asking about only shows my ignorance..
>>>
>>>
>>> Regards, Hågen
>

Re: Wildcards and fuzzy/phonetic query

Reply via email to