I think you can get what you want by escaping the space with a backslash....
YMMV of course. Erick On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini < andrea.gazzar...@gmail.com> wrote: > Hi Erick, > sorry if that wasn't clear: this is what I'm actually observing in my > application. > > I wrote the first post after looking at the explain (debugQuery=true): the > query > > q=mag 778 G 69 > > is translated as follow: > > > / +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1) > DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1) > DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1) > DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4) > DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/ > > It seems that althouhg I declare myfield with this type > > /<fieldtype name="type1" class="solr.TextField" > > > <analyzer> > <tokenizer class="solr.**KeywordTokenizerFactory*" /> > > <filter class="solr.**LowerCaseFilterFactory" /> > <filter class="solr.**WordDelimiterFilterFactory" > generateWordParts="0" generateNumberParts="0" > catenateWords="0" catenateNumbers="0" > catenateAll="1"**splitOnCaseChange="0" > /> > </analyzer> > </fieldtype> > > /SOLR is tokenizing it therefore by producing several tokens > (mag,778,g,69)/ > / > > And I can't put double quotes on the query (q="mag 778 G 69") because the > request handler searches also in other fields (with different configuration > chains) > > As I understood the query parser, (i.e. query time), does a whitespace > tokenization on its own before invoking my (query-time) chain. The same > doesn't happen at index time...this is my problem...because at index time > the field is analyzed exactly as I want...but unfortunately cannot say the > same at query time. > > Sorry for my wonderful english, did you get the point? > > > On 08/13/2013 02:18 PM, Erick Erickson wrote: > >> On a quick scan I don't see a problem here. Attach >> &debug=query to your url and that'll show you the >> parsed query, which will in turn show you what's been >> pushed through the analysis chain you've defined. >> >> You haven't stated whether you've tried this and it's >> not working or you're looking for guidance as to how >> to accomplish this so it's a little unclear how to >> respond. >> >> BTW, the admin/analysis page is your friend here.... >> >> Best >> Erick >> >> >> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini < >> andrea.gazzar...@gmail.com> wrote: >> >> Clear, thanks for response. >>> >>> So, if I have two fields >>> >>> <fieldtype name="type1" class="solr.TextField" > >>> <analyzer> >>> <tokenizer class="solr.****KeywordTokenizerFactory*" /> >>> >>> <filter class="solr.****LowerCaseFilterFactory" /> >>> <filter class="solr.****WordDelimiterFilterFactory" >>> >>> generateWordParts="0" generateNumberParts="0" >>> catenateWords="0" catenateNumbers="0" catenateAll="1" >>> splitOnCaseChange="0" /> >>> </analyzer> >>> </fieldtype> >>> <fieldtype name="type2" class="solr.TextField" > >>> <analyzer> >>> <charFilter class="solr.****MappingCharFilterFactory" >>> mapping="mapping-FoldToASCII.****txt"/> >>> <tokenizer class="solr.****WhitespaceTokenizerFactory" /> >>> <filter class="solr.****LowerCaseFilterFactory" /> >>> <filter class="solr.****WordDelimiterFilterFactory" .../> >>> >>> </analyzer> >>> </fieldtype> >>> >>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field >>> type ends with several tokens) >>> >>> And I want to use the same request handler to query against both of them. >>> I mean I want the user search something like >>> >>> http//..../search?q=Mag 78 D 99 >>> >>> and this search should search within both the first (with type1) and >>> second (with type 2) by matching >>> >>> - a document which has field_with_type1 equals to *mag78d99* or >>> - a document which has field_with_type2 that contains a text like "go to >>> *mag 78*, class *d* and subclass *99*) >>> >>> >>> <requestHandler ....> >>> ... >>> <str name="defType">dismax</str> >>> ... >>> <str name="mm">100%</str> >>> <str name="qf"> >>> field_with_type1 >>> field_with_type_2 >>> </str> >>> ... >>> </requestHandler> >>> >>> is not possible? If so, is possible to do that in some other way? >>> >>> Sorry for the long email and thanks again >>> Andrea >>> >>> >>> On 08/12/2013 04:01 PM, Jack Krupansky wrote: >>> >>> Quoted phrases will be passed to the analyzer as one string, so there a >>>> white space tokenizer is needed. >>>> >>>> -- Jack Krupansky >>>> >>>> -----Original Message----- From: Andrea Gazzarini >>>> Sent: Monday, August 12, 2013 6:52 AM >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: Tokenization at query time >>>> >>>> Hi Tanguy, >>>> thanks for fast response. What you are saying corresponds perfectly with >>>> the behaviour I'm observing. >>>> Now, other than having a big problem (I have several other fields both >>>> in the pf and qf where spaces doesn't matter, field types like the >>>> "text_en" field type in the example schema) what I'm wondering is: >>>> >>>> /"The query parser splits the input query on white spaces, and the each >>>> token is analysed according to your configuration"// >>>> / >>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query >>>> analyzer? If the input query is already parsed (i.e. whitespace >>>> tokenized) what is its effect? >>>> >>>> Thank you very much for the help >>>> Andrea >>>> >>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote: >>>> >>>> Hello Andrea, >>>>> I think you face a rather common issue involving keyword tokenization >>>>> and query parsing in Lucene: >>>>> The query parser splits the input query on white spaces, and then each >>>>> token is analysed according to your configuration. >>>>> So those queries with a whitespace won't behave as expected because >>>>> each >>>>> token is analysed separately. Consequently, the catenated version of >>>>> the >>>>> reference cannot be generated. >>>>> I think you could try surrounding your query with double quotes or >>>>> escaping the space characters in your query using a backslash so that >>>>> the >>>>> whole sequence is analysed in the same analyser and the catenation >>>>> occurs. >>>>> You should be aware that this approach has a drawback: you will >>>>> probably >>>>> not be able to combine the search for Mag. 778 G 69 with other words in >>>>> other fields unless you are able to identify which spaces are to be >>>>> escaped: >>>>> For example, if input the query is: >>>>> Awesome Mag. 778 G 69 >>>>> you would want to transform it to: >>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only >>>>> or >>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase >>>>> query >>>>> >>>>> Do you get the point? >>>>> >>>>> Look at the differences between what you tried and the following >>>>> examples which should all do what you want: >>>>> http://localhost:8983/solr/****collection1/select?q=%22Mag.%**** >>>>> 20778%20G%2069%22&debugQuery=****on&qf=text%20myfield&defType=** >>>>> **dismax<http://localhost:**8983/solr/collection1/select?** >>>>> q=%22Mag.%20778%20G%2069%22&**debugQuery=on&qf=text%** >>>>> 20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax> >>>>> > >>>>> OR >>>>> http://localhost:8983/solr/****collection1/select?q=myfield:****Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag> >>>>> <http://localhost:8983/**solr/collection1/select?q=**myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag> >>>>> > >>>>> .\%20778\%20G\%2069&****debugQuery=on >>>>> OR >>>>> http://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag> >>>>> <http:**//localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag> >>>>> > >>>>> .\%**20778\%20G\%2069&**debugQuery=**on&qf=text%** >>>>> 20myfield&defType=**edismax >>>>> >>>>> >>>>> >>>>> I hope this helps >>>>> >>>>> Tanguy >>>>> >>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini < >>>>> andrea.gazzar...@gmail.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>>> I have a field (among others)in my schema defined like this: >>>>>> >>>>>> <fieldtype name="mytype" class="solr.TextField" >>>>>> positionIncrementGap="100"> >>>>>> <analyzer> >>>>>> <tokenizer class="solr.*****KeywordTokenizerFactory*" /> >>>>>> <filter class="solr.****LowerCaseFilterFactory" /> >>>>>> <filter class="solr.****WordDelimiterFilterFactory" >>>>>> >>>>>> generateWordParts="0" >>>>>> generateNumberParts="0" >>>>>> catenateWords="0" >>>>>> catenateNumbers="0" >>>>>> catenateAll="1" >>>>>> splitOnCaseChange="0" /> >>>>>> </analyzer> >>>>>> </fieldtype> >>>>>> >>>>>> <field name="myfield" type="mytype" indexed="true"/> >>>>>> >>>>>> Basically, both at index and query time the field value is normalized >>>>>> like this. >>>>>> >>>>>> Mag. 778 G 69 => mag778g69 >>>>>> >>>>>> Now, in my solrconfig I'm using a search handler like this: >>>>>> fossero solo le sue le gambe >>>>>> >>>>>> <requestHandler ....> >>>>>> ... >>>>>> <str name="defType">dismax</str> >>>>>> ... >>>>>> <str name="mm">100%</str> >>>>>> <str name="qf">myfield^3000</str> >>>>>> <str name="pf">myfield^30000</str> >>>>>> >>>>>> </requestHandler> >>>>>> >>>>>> What I'm expecting is that if I index a document with a value for my >>>>>> field "Mag. 778 G 69", I will be able to get this document by querying >>>>>> >>>>>> 1. Mag. 778 G 69 >>>>>> 2. mag 778 g69 >>>>>> 3. mag778g69 >>>>>> >>>>>> But that doesn't wotk: i'm able to get the document only and if only I >>>>>> use the "normalized2 form: mag778g69 >>>>>> >>>>>> After doing a little bit of debug, I see that, even I used a >>>>>> KeywordTokenizer in my field type declaration, SOLR is doing >>>>>> soemthign like >>>>>> this: >>>>>> / >>>>>> // +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1) >>>>>> DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1) >>>>>> DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1) >>>>>> DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4) >>>>>> DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.****0)~0.1)/ >>>>>> >>>>>> >>>>>> That is, it is tokenizing the original query string (mag + 778 + g + >>>>>> 69) and obviously querying the field for separate tokens doesn't match >>>>>> anything (at least this is what I think) >>>>>> >>>>>> Does anybody could please explain me that? >>>>>> >>>>>> Thanks in advance >>>>>> Andrea >>>>>> >>>>>> >>>>> >