I am looking for a way to fold a particular sequence of tokens into one token. Concretely, I'd like to detect a three-token sequence of "*", ":" and "*", and replace it with a token of the text "*:*". I tried SynonymFIlter but it seems it can only deal with a single input token. "* : * => *:*" seems to be interpreted
as one input token of 5 characters "*", space, ":", space and "*".

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence "*:*" into 3 tokens of one character each. The edismax parser, when given the query "*:*", i.e. find every doc, seems to pass the entire string "*:*" to the query analyzer (I suspect a bug.),
and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:

<lst name="debug">
<str name="rawquerystring">*:*</str>
<str name="querystring">*:*</str>
<str name="parsedquery">+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((body:"* : *"~100^0.5 | title:"* : *"~100^1.2)~0.01)</str> <str name="parsedquery_toString">+*:* (body:"* : *"~100^0.5 | title:"* : *"~100^1.2)~0.01</str>

Notice that there is a space between * and : in DisjunctionMaxQuery((body:"* : *" ....)

Probably because of this, the hit score is as low as 0.109, while it is 1.000 if an analyzer that doesn't break "*:*" is used. So I'd like to stitch together "*", ":", "*" into "*:*" again to make DisjunctionMaxQuery happy.


Thanks.


T. "Kuro" Kurosaka


Reply via email to