Clear, thanks for response.
So, if I have two fields
<fieldtype name="type1" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory*" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" catenateAll="1"
splitOnCaseChange="0" />
</analyzer>
</fieldtype>
<fieldtype name="type2" class="solr.TextField" >
<analyzer>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" .../>
</analyzer>
</fieldtype>
(first field type *Mag. 78 D 99* becomes *mag78d99* while second field
type ends with several tokens)
And I want to use the same request handler to query against both of
them. I mean I want the user search something like
http//..../search?q=Mag 78 D 99
and this search should search within both the first (with type1) and
second (with type 2) by matching
- a document which has field_with_type1 equals to *mag78d99* or
- a document which has field_with_type2 that contains a text like "go to
*mag 78*, class *d* and subclass *99*)
<requestHandler ....>
...
<str name="defType">dismax</str>
...
<str name="mm">100%</str>
<str name="qf">
field_with_type1
field_with_type_2
</str>
...
</requestHandler>
is not possible? If so, is possible to do that in some other way?
Sorry for the long email and thanks again
Andrea
On 08/12/2013 04:01 PM, Jack Krupansky wrote:
Quoted phrases will be passed to the analyzer as one string, so there
a white space tokenizer is needed.
-- Jack Krupansky
-----Original Message----- From: Andrea Gazzarini
Sent: Monday, August 12, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Tokenization at query time
Hi Tanguy,
thanks for fast response. What you are saying corresponds perfectly with
the behaviour I'm observing.
Now, other than having a big problem (I have several other fields both
in the pf and qf where spaces doesn't matter, field types like the
"text_en" field type in the example schema) what I'm wondering is:
/"The query parser splits the input query on white spaces, and the each
token is analysed according to your configuration"//
/
Is there a valid reason to declare a WhiteSpaceTokenizer in a query
analyzer? If the input query is already parsed (i.e. whitespace
tokenized) what is its effect?
Thank you very much for the help
Andrea
On 08/12/2013 12:37 PM, Tanguy Moal wrote:
Hello Andrea,
I think you face a rather common issue involving keyword tokenization
and query parsing in Lucene:
The query parser splits the input query on white spaces, and then
each token is analysed according to your configuration.
So those queries with a whitespace won't behave as expected because
each token is analysed separately. Consequently, the catenated
version of the reference cannot be generated.
I think you could try surrounding your query with double quotes or
escaping the space characters in your query using a backslash so that
the whole sequence is analysed in the same analyser and the
catenation occurs.
You should be aware that this approach has a drawback: you will
probably not be able to combine the search for Mag. 778 G 69 with
other words in other fields unless you are able to identify which
spaces are to be escaped:
For example, if input the query is:
Awesome Mag. 778 G 69
you would want to transform it to:
Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
or
Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
query
Do you get the point?
Look at the differences between what you tried and the following
examples which should all do what you want:
http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax
OR
http://localhost:8983/solr/collection1/select?q=myfield:Mag.\%20778\%20G\%2069&debugQuery=on
OR
http://localhost:8983/solr/collection1/select?q=Mag.\%20778\%20G\%2069&debugQuery=on&qf=text%20myfield&defType=edismax
I hope this helps
Tanguy
On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini
<andrea.gazzar...@gmail.com> wrote:
Hi all,
I have a field (among others)in my schema defined like this:
<fieldtype name="mytype" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.*KeywordTokenizerFactory*" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0"
generateNumberParts="0"
catenateWords="0"
catenateNumbers="0"
catenateAll="1"
splitOnCaseChange="0" />
</analyzer>
</fieldtype>
<field name="myfield" type="mytype" indexed="true"/>
Basically, both at index and query time the field value is
normalized like this.
Mag. 778 G 69 => mag778g69
Now, in my solrconfig I'm using a search handler like this:
fossero solo le sue le gambe
<requestHandler ....>
...
<str name="defType">dismax</str>
...
<str name="mm">100%</str>
<str name="qf">myfield^3000</str>
<str name="pf">myfield^30000</str>
</requestHandler>
What I'm expecting is that if I index a document with a value for my
field "Mag. 778 G 69", I will be able to get this document by querying
1. Mag. 778 G 69
2. mag 778 g69
3. mag778g69
But that doesn't wotk: i'm able to get the document only and if only
I use the "normalized2 form: mag778g69
After doing a little bit of debug, I see that, even I used a
KeywordTokenizer in my field type declaration, SOLR is doing
soemthign like this:
/
// +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1)
DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1)
DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1)
DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4)
DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/
That is, it is tokenizing the original query string (mag + 778 + g +
69) and obviously querying the field for separate tokens doesn't
match anything (at least this is what I think)
Does anybody could please explain me that?
Thanks in advance
Andrea