Re: edismax, phrase field gets ignored for keyword tokenizer

Vincenzo D'Amore Mon, 07 Nov 2016 07:43:36 -0800

Hi Stefan,

I think the problem is solr.KeywordTokenizerFactory.
This tokeniser does not make any tokenisation to the string, it returns
exactly what you have.


'+49 1234 12345678' -> '+49 1234 12345678'

On the other hand, using edismax you are looking for '+49', '1234' and
'12345678' and none of these keywords match your phone_number field.

Try using a different tokenizer like solr.StandardTokenizerFactory, this
should change your results.

Bests,
Vincenzo

On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis <matheis.ste...@gmail.com>
wrote:

> I’m guessing that i’m missing something obvious here - so feel free to
> ask for more details and as well point out other directions i should
> following.
>
> the problem goes as follows: the input in one case might be a phone
> number (like +49 1234 12345678), since we’re using edismax the parts
> gets split on whitespaces - which is fine. bringing the same field
> (based on TextField) to the party (using qf) doesn’t change a thing.
>
> > responseHeader:
> >     params:
> >         q: '+49 1234 12345678'
> >         defType: edismax
> >         qf: person_mobile
> >         pf: person_mobile^5
> > debug:
> >     rawquerystring: '+49 1234 12345678'
> >     querystring: '+49 1234 12345678'
> >     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> DisjunctionMaxQuery((person_mobile:1234)) 
> DisjunctionMaxQuery((person_mobile:12345678)))
> ())/no_coord'
> >     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> (person_mobile:12345678)) ()’
>
> but .. as far as i was able to reduce the culprit, that only happens
> when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> that to solr.StandardTokenizerFactory the phrase query appears as
> expected:
>
> > responseHeader:
> >     params:
> >         q: '+49 1234 12345678'
> >         defType: edismax
> >         qf: person_mobile
> >         pf: person_mobile^5
> > debug:
> >     rawquerystring: '+49 1234 12345678'
> >     querystring: '+49 1234 12345678'
> >     parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> DisjunctionMaxQuery((person_mobile:1234)) 
> DisjunctionMaxQuery((person_mobile:12345678)))
> DisjunctionMaxQuery(((person_mobile:"49 1234 12345678")^5.0)))/no_coord'
> >     parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
>
> removing the + at the beginning, doesn’t make a difference either
> (just mentioning since tokee already asked this on #solr, where i’ve
> brought up the question earlier)
>
> it’s absolutely possible i’m focusing on a very wrong assumption - but
> since switching the tokenizer does result in such a rather large
> behaviour change, i think something is spooky here.
>
> i’ve read older issues and posts from the list, some of them pointed
> out that it might be a optimization that edismax brings to the table -
> i didn’t find anything specific about that.
>
> oh, and btw: if that would be working - my idea is to drop out
> everything for a given phrase that is not a number, to match the phone
> number, like this:
>
> > <fieldType name="phone_number" class="solr.TextField">
> >   <analyzer>
> >     <tokenizer class="solr.KeywordTokenizerFactory"/>
> >     <filter class="solr.PatternReplaceFilterFactory" pattern="[^\d]"
> replacement=""/>
> >   </analyzer>
> > </fieldType>
>
> any thoughts? or wild guesses?
>
> Thanks Stefan
>



-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251

Re: edismax, phrase field gets ignored for keyword tokenizer

Reply via email to