You can build shingles and then use the synonym filter. in this case you will have to think about all these token that you don't need after the shingle filter.

Am 12.10.2012 01:35, schrieb T. Kuro Kurosaka:
I am looking for a way to fold a particular sequence of tokens into one token. Concretely, I'd like to detect a three-token sequence of "*", ":" and "*", and replace it with a token of the text "*:*". I tried SynonymFIlter but it seems it can only deal with a single input token. "* : * => *:*" seems to be interpreted
as one input token of 5 characters "*", space, ":", space and "*".

I'm using Solr 3.5.

Background:
My tokenizer separate the three character sequence "*:*" into 3 tokens of one character each. The edismax parser, when given the query "*:*", i.e. find every doc, seems to pass the entire string "*:*" to the query analyzer (I suspect a bug.),
and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:

<lst name="debug">
<str name="rawquerystring">*:*</str>
<str name="querystring">*:*</str>
<str name="parsedquery">+MatchAllDocsQuery(*:*) DisjunctionMaxQuery((body:"* : *"~100^0.5 | title:"* : *"~100^1.2)~0.01)</str> <str name="parsedquery_toString">+*:* (body:"* : *"~100^0.5 | title:"* : *"~100^1.2)~0.01</str>

Notice that there is a space between * and : in DisjunctionMaxQuery((body:"* : *" ....)

Probably because of this, the hit score is as low as 0.109, while it is 1.000 if an analyzer that doesn't break "*:*" is used. So I'd like to stitch together "*", ":", "*" into "*:*" again to make DisjunctionMaxQuery happy.


Thanks.


T. "Kuro" Kurosaka



Reply via email to