Re: Tokenization at query time

Jack Krupansky Mon, 12 Aug 2013 07:02:32 -0700

Quoted phrases will be passed to the analyzer as one string, so there awhite space tokenizer is needed.


-- Jack Krupansky

-----Original Message-----From: Andrea Gazzarini

Sent: Monday, August 12, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Tokenization at query time

Hi Tanguy,
thanks for fast response. What you are saying corresponds perfectly with
the behaviour I'm observing.
Now, other than having a big problem (I have several other fields both
in the pf and qf where spaces doesn't matter, field types like the
"text_en" field type in the example schema) what I'm wondering is:

/"The query parser splits the input query on white spaces, and the each
token is analysed according to your configuration"//
/
Is there a valid reason to declare a WhiteSpaceTokenizer in a query
analyzer? If the input query is already parsed (i.e. whitespace
tokenized) what is its effect?

Thank you very much for the help
Andrea

On 08/12/2013 12:37 PM, Tanguy Moal wrote:

Hello Andrea,
I think you face a rather common issue involving keyword tokenization andquery parsing in Lucene:The query parser splits the input query on white spaces, and then eachtoken is analysed according to your configuration.So those queries with a whitespace won't behave as expected because eachtoken is analysed separately. Consequently, the catenated version of thereference cannot be generated.I think you could try surrounding your query with double quotes orescaping the space characters in your query using a backslash so that thewhole sequence is analysed in the same analyser and the catenation occurs.You should be aware that this approach has a drawback: you will probablynot be able to combine the search for Mag. 778 G 69 with other words inother fields unless you are able to identify which spaces are to beescaped:
For example, if input the query is:
Awesome Mag. 778 G 69
you would want to transform it to:
Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
or
Awesome "Mag. 778 G 69" // only the reference is turned into a phrasequery
Do you get the point?
Look at the differences between what you tried and the following exampleswhich should all do what you want:
http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax
OR
http://localhost:8983/solr/collection1/select?q=myfield:Mag.\%20778\%20G\%2069&debugQuery=on
OR
http://localhost:8983/solr/collection1/select?q=Mag.\%20778\%20G\%2069&debugQuery=on&qf=text%20myfield&defType=edismax

I hope this helps

Tanguy
On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini<andrea.gazzar...@gmail.com> wrote:
Hi all,
I have a field (among others)in my schema defined like this:
<fieldtype name="mytype" class="solr.TextField"positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.*KeywordTokenizerFactory*" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="0"
            generateNumberParts="0"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="1"
            splitOnCaseChange="0" />
    </analyzer>
</fieldtype>

<field name="myfield" type="mytype" indexed="true"/>
Basically, both at index and query time the field value is normalizedlike this.
Mag. 778 G 69 => mag778g69

Now, in my solrconfig I'm using a search handler like this:

<requestHandler ....>
    ...
    <str name="defType">dismax</str>
    ...
    <str name="mm">100%</str>
    <str name="qf">myfield^3000</str>
    <str name="pf">myfield^30000</str>

</requestHandler>
What I'm expecting is that if I index a document with a value for myfield "Mag. 778 G 69", I will be able to get this document by querying
1. Mag. 778 G 69
2. mag 778 g69
3. mag778g69
But that doesn't wotk: i'm able to get the document only and if only Iuse the "normalized2 form: mag778g69
After doing a little bit of debug, I see that, even I used aKeywordTokenizer in my field type declaration, SOLR is doing soemthignlike this:
/
// +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1)DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1)DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1)DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4)DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/
That is, it is tokenizing the original query string (mag + 778 + g + 69)and obviously querying the field for separate tokens doesn't matchanything (at least this is what I think)
Does anybody could please explain me that?

Thanks in advance
Andrea

Re: Tokenization at query time

Reply via email to