Re: edismax, phrase field gets ignored for keyword tokenizer

Vincenzo D'Amore Mon, 07 Nov 2016 08:09:20 -0800

If you don't want partial matches with edismax you should always use
StandardTokenizerFactory and play with mm parameter.


On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis <matheis.ste...@gmail.com>
wrote:

> Vincenzo,
>
> thanks for the response - i know that only the Keyword Tokenizer by
> itself does not do anything. as pointed at the end of the initial
> mail, i’m applying a pattern replace for everything non-numeric to
> make it actually useful.
>
> and especially because of the tokenization based on whitespaces i’d
> like to use the very same field once again as phrase field to around
> this issue. Shawn mentioned in #solr in the meantime that there is
> SOLR-9185 which is similar and would be helpful, but currently very
> very in-the-works.
>
> Standard Tokenizer you’ve mentioned does split on whitespace - as
> edismax does by default in the first place. so i’m not sure how that
> would help? For now, i don’t want to have partial matches on phone
> numbers .. at least not yet.
>
> -Stefan
>
>
> On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (v.dam...@gmail.com)
> wrote:
> > Hi Stefan,
> >
> > I think the problem is solr.KeywordTokenizerFactory.
> > This tokeniser does not make any tokenisation to the string, it returns
> > exactly what you have.
> >
> > '+49 1234 12345678' -> '+49 1234 12345678'
> >
> > On the other hand, using edismax you are looking for '+49', '1234' and
> > '12345678' and none of these keywords match your phone_number field.
> >
> > Try using a different tokenizer like solr.StandardTokenizerFactory, this
> > should change your results.
> >
> > Bests,
> > Vincenzo
> >
> > On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> > wrote:
> >
> > > I’m guessing that i’m missing something obvious here - so feel free to
> > > ask for more details and as well point out other directions i should
> > > following.
> > >
> > > the problem goes as follows: the input in one case might be a phone
> > > number (like +49 1234 12345678), since we’re using edismax the parts
> > > gets split on whitespaces - which is fine. bringing the same field
> > > (based on TextField) to the party (using qf) doesn’t change a thing.
> > >
> > > > responseHeader:
> > > > params:
> > > > q: '+49 1234 12345678'
> > > > defType: edismax
> > > > qf: person_mobile
> > > > pf: person_mobile^5
> > > > debug:
> > > > rawquerystring: '+49 1234 12345678'
> > > > querystring: '+49 1234 12345678'
> > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> mobile:12345678)))
> > > ())/no_coord'
> > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > (person_mobile:12345678)) ()’
> > >
> > > but .. as far as i was able to reduce the culprit, that only happens
> > > when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> > > that to solr.StandardTokenizerFactory the phrase query appears as
> > > expected:
> > >
> > > > responseHeader:
> > > > params:
> > > > q: '+49 1234 12345678'
> > > > defType: edismax
> > > > qf: person_mobile
> > > > pf: person_mobile^5
> > > > debug:
> > > > rawquerystring: '+49 1234 12345678'
> > > > querystring: '+49 1234 12345678'
> > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> mobile:12345678)))
> > > DisjunctionMaxQuery(((person_mobile:"49 1234
> 12345678")^5.0)))/no_coord'
> > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
> > >
> > > removing the + at the beginning, doesn’t make a difference either
> > > (just mentioning since tokee already asked this on #solr, where i’ve
> > > brought up the question earlier)
> > >
> > > it’s absolutely possible i’m focusing on a very wrong assumption - but
> > > since switching the tokenizer does result in such a rather large
> > > behaviour change, i think something is spooky here.
> > >
> > > i’ve read older issues and posts from the list, some of them pointed
> > > out that it might be a optimization that edismax brings to the table -
> > > i didn’t find anything specific about that.
> > >
> > > oh, and btw: if that would be working - my idea is to drop out
> > > everything for a given phrase that is not a number, to match the phone
> > > number, like this:
> > >
> > > >
> > > >
> > > >
> > > > > > replacement=""/>
> > > >
> > > >
> > >
> > > any thoughts? or wild guesses?
> > >
> > > Thanks Stefan
> > >
> >
> >
> >
> > --
> > Vincenzo D'Amore
> > email: v.dam...@gmail.com
> > skype: free.dev
> > mobile: +39 349 8513251
> >
>



-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251

Re: edismax, phrase field gets ignored for keyword tokenizer

Reply via email to