Re: Tokenization at query time

Erick Erickson Tue, 13 Aug 2013 07:19:33 -0700

I think you can get what you want by escaping the space with a backslash....


YMMV of course.
Erick


On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
andrea.gazzar...@gmail.com> wrote:

> Hi Erick,
> sorry if that wasn't clear: this is what I'm actually observing in my
> application.
>
> I wrote the first post after looking at the explain (debugQuery=true): the
> query
>
> q=mag 778 G 69
>
> is translated as follow:
>
>
> /  +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
>       DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
>       DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
>       DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
>       DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/
>
> It seems that althouhg I declare myfield with this type
>
> /<fieldtype name="type1" class="solr.TextField" >
>
>     <analyzer>
>         <tokenizer class="solr.**KeywordTokenizerFactory*" />
>
>         <filter class="solr.**LowerCaseFilterFactory" />
>         <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0"
>             catenateWords="0" catenateNumbers="0" 
> catenateAll="1"**splitOnCaseChange="0"
> />
>     </analyzer>
> </fieldtype>
>
> /SOLR is tokenizing it therefore by producing several tokens
> (mag,778,g,69)/
> /
>
> And I can't put double quotes on the query (q="mag 778 G 69") because the
> request handler searches also in other fields (with different configuration
> chains)
>
> As I understood the query parser, (i.e. query time), does a whitespace
> tokenization on its own before invoking my (query-time) chain. The same
> doesn't happen at index time...this is my problem...because at index time
> the field is analyzed exactly as I want...but unfortunately cannot say the
> same at query time.
>
> Sorry for my wonderful english, did you get the point?
>
>
> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>
>> On a quick scan I don't see a problem here. Attach
>> &debug=query to your url and that'll show you the
>> parsed query, which will in turn show you what's been
>> pushed through the analysis chain you've defined.
>>
>> You haven't stated whether you've tried this and it's
>> not working or you're looking for guidance as to how
>> to accomplish this so it's a little unclear how to
>> respond.
>>
>> BTW, the admin/analysis page is your friend here....
>>
>> Best
>> Erick
>>
>>
>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>> andrea.gazzar...@gmail.com> wrote:
>>
>>  Clear, thanks for response.
>>>
>>> So, if I have two fields
>>>
>>> <fieldtype name="type1" class="solr.TextField" >
>>>      <analyzer>
>>>          <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>
>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>          <filter class="solr.****WordDelimiterFilterFactory"
>>>
>>> generateWordParts="0" generateNumberParts="0"
>>>              catenateWords="0" catenateNumbers="0" catenateAll="1"
>>> splitOnCaseChange="0" />
>>>      </analyzer>
>>> </fieldtype>
>>> <fieldtype name="type2" class="solr.TextField" >
>>>      <analyzer>
>>>          <charFilter class="solr.****MappingCharFilterFactory"
>>> mapping="mapping-FoldToASCII.****txt"/>
>>>          <tokenizer class="solr.****WhitespaceTokenizerFactory" />
>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>          <filter class="solr.****WordDelimiterFilterFactory" .../>
>>>
>>>      </analyzer>
>>> </fieldtype>
>>>
>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>> type ends with several tokens)
>>>
>>> And I want to use the same request handler to query against both of them.
>>> I mean I want the user search something like
>>>
>>> http//..../search?q=Mag 78 D 99
>>>
>>> and this search should search within both the first (with type1) and
>>> second (with type 2) by matching
>>>
>>> - a document which has field_with_type1 equals to *mag78d99* or
>>> - a document which has field_with_type2 that contains a text like "go to
>>> *mag 78*, class *d* and subclass *99*)
>>>
>>>
>>> <requestHandler ....>
>>>      ...
>>>      <str name="defType">dismax</str>
>>>      ...
>>>      <str name="mm">100%</str>
>>>      <str name="qf">
>>>          field_with_type1
>>>          field_with_type_2
>>>      </str>
>>>      ...
>>> </requestHandler>
>>>
>>> is not possible? If so, is possible to do that in some other way?
>>>
>>> Sorry for the long email and thanks again
>>> Andrea
>>>
>>>
>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>
>>>  Quoted phrases will be passed to the analyzer as one string, so there a
>>>> white space tokenizer is needed.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Andrea Gazzarini
>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Tokenization at query time
>>>>
>>>> Hi Tanguy,
>>>> thanks for fast response. What you are saying corresponds perfectly with
>>>> the behaviour I'm observing.
>>>> Now, other than having a big problem (I have several other fields both
>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>
>>>> /"The query parser splits the input query on white spaces, and the each
>>>> token is analysed according to your configuration"//
>>>> /
>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>> tokenized) what is its effect?
>>>>
>>>> Thank you very much for the help
>>>> Andrea
>>>>
>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>
>>>>  Hello Andrea,
>>>>> I think you face a rather common issue involving keyword tokenization
>>>>> and query parsing in Lucene:
>>>>> The query parser splits the input query on white spaces, and then each
>>>>> token is analysed according to your configuration.
>>>>> So those queries with a whitespace won't behave as expected because
>>>>> each
>>>>> token is analysed separately. Consequently, the catenated version of
>>>>> the
>>>>> reference cannot be generated.
>>>>> I think you could try surrounding your query with double quotes or
>>>>> escaping the space characters in your query using a backslash so that
>>>>> the
>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>> occurs.
>>>>> You should be aware that this approach has a drawback: you will
>>>>> probably
>>>>> not be able to combine the search for Mag. 778 G 69 with other words in
>>>>> other fields unless you are able to identify which spaces are to be
>>>>> escaped:
>>>>> For example, if input the query is:
>>>>> Awesome Mag. 778 G 69
>>>>> you would want to transform it to:
>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>> or
>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>> query
>>>>>
>>>>> Do you get the point?
>>>>>
>>>>> Look at the differences between what you tried and the following
>>>>> examples which should all do what you want:
>>>>> http://localhost:8983/solr/****collection1/select?q=%22Mag.%****
>>>>> 20778%20G%2069%22&debugQuery=****on&qf=text%20myfield&defType=**
>>>>> **dismax<http://localhost:**8983/solr/collection1/select?**
>>>>> q=%22Mag.%20778%20G%2069%22&**debugQuery=on&qf=text%**
>>>>> 20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>> >
>>>>> OR
>>>>> http://localhost:8983/solr/****collection1/select?q=myfield:****Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>> <http://localhost:8983/**solr/collection1/select?q=**myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>> >
>>>>> .\%20778\%20G\%2069&****debugQuery=on
>>>>> OR
>>>>> http://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>> <http:**//localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>> >
>>>>> .\%**20778\%20G\%2069&**debugQuery=**on&qf=text%**
>>>>> 20myfield&defType=**edismax
>>>>>
>>>>>
>>>>>
>>>>> I hope this helps
>>>>>
>>>>> Tanguy
>>>>>
>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>> andrea.gazzar...@gmail.com> wrote:
>>>>>
>>>>>   Hi all,
>>>>>
>>>>>> I have a field (among others)in my schema defined like this:
>>>>>>
>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>> positionIncrementGap="100">
>>>>>>      <analyzer>
>>>>>>          <tokenizer class="solr.*****KeywordTokenizerFactory*" />
>>>>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>>>>          <filter class="solr.****WordDelimiterFilterFactory"
>>>>>>
>>>>>>              generateWordParts="0"
>>>>>>              generateNumberParts="0"
>>>>>>              catenateWords="0"
>>>>>>              catenateNumbers="0"
>>>>>>              catenateAll="1"
>>>>>>              splitOnCaseChange="0" />
>>>>>>      </analyzer>
>>>>>> </fieldtype>
>>>>>>
>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>
>>>>>> Basically, both at index and query time the field value is normalized
>>>>>> like this.
>>>>>>
>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>
>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>> fossero solo le sue le gambe
>>>>>>
>>>>>> <requestHandler ....>
>>>>>>      ...
>>>>>>      <str name="defType">dismax</str>
>>>>>>      ...
>>>>>>      <str name="mm">100%</str>
>>>>>>      <str name="qf">myfield^3000</str>
>>>>>>      <str name="pf">myfield^30000</str>
>>>>>>
>>>>>> </requestHandler>
>>>>>>
>>>>>> What I'm expecting is that if I index a document with a value for my
>>>>>> field "Mag. 778 G 69", I will be able to get this document by querying
>>>>>>
>>>>>> 1. Mag. 778 G 69
>>>>>> 2. mag 778 g69
>>>>>> 3. mag778g69
>>>>>>
>>>>>> But that doesn't wotk: i'm able to get the document only and if only I
>>>>>> use the "normalized2 form: mag778g69
>>>>>>
>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>> soemthign like
>>>>>> this:
>>>>>> /
>>>>>> // +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>>>> DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>>>> DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>>>> DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>>>> DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.****0)~0.1)/
>>>>>>
>>>>>>
>>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>>> 69) and obviously querying the field for separate tokens doesn't match
>>>>>> anything (at least this is what I think)
>>>>>>
>>>>>> Does anybody could please explain me that?
>>>>>>
>>>>>> Thanks in advance
>>>>>> Andrea
>>>>>>
>>>>>>
>>>>>
>

Re: Tokenization at query time

Reply via email to