On 10/11/12 4:47 PM, Jack Krupansky wrote:
The ":" which normally separates a field name from a term (or quoted string or parenthesized sub-query) is "parsed" by the query parser before analysis gets called, and "*:*" is recognized before analysis as well. So, any attempt to recreate "*:*" in analysis will be too late to affect query parsing and other pre-analysis processing.
That's why I suspect a bug in Solr. Tokenizer shouldn't play any roles here but it is affecting the score calculation. I am seeing an evidence that "*:*" is being passed to my tokenizer. I'm trying to find a way to work around this by reconstructing "*:*" in the analysis chain.

But, what is it you are really trying to do? What's the real problem? (This sounds like a proverbial "XY Problem".)

-- Jack Krupansky

-----Original Message----- From: T. Kuro Kurosaka
Sent: Thursday, October 11, 2012 7:35 PM
To: solr-user@lucene.apache.org
Subject: Any filter to map mutiple tokens into one ?

I am looking for a way to fold a particular sequence of tokens into one
token.
Concretely, I'd like to detect a three-token sequence of "*", ":" and
"*", and replace it with a token of the text "*:*".
I tried SynonymFIlter but it seems it can only deal with a single input
token. "* : * => *:*" seems to be interpreted
as one input token of 5 characters "*", space, ":", space and "*".

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence "*:*" into 3 tokens
of one character each.
The edismax parser, when given the query "*:*", i.e. find every doc,
seems to pass the entire string "*:*" to the query analyzer  (I suspect
a bug.),
and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:

<lst name="debug">
<str name="rawquerystring">*:*</str>
<str name="querystring">*:*</str>
<str name="parsedquery">+MatchAllDocsQuery(*:*)
DisjunctionMaxQuery((body:"* : *"~100^0.5 | title:"* :
*"~100^1.2)~0.01)</str>
<str name="parsedquery_toString">+*:* (body:"* : *"~100^0.5 | title:"* :
*"~100^1.2)~0.01</str>

Notice that there is a space between * and : in
DisjunctionMaxQuery((body:"* : *" ....)

Probably because of this, the hit score is as low as 0.109, while it is
1.000 if an analyzer that doesn't break "*:*" is used.
So I'd like to stitch together "*", ":", "*" into "*:*" again to make
DisjunctionMaxQuery happy.


Thanks.


T. "Kuro" Kurosaka


Reply via email to