You can build shingles and then use the synonym filter. in this case you
will have to think about all these token that you don't need after the
shingle filter.
Am 12.10.2012 01:35, schrieb T. Kuro Kurosaka:
I am looking for a way to fold a particular sequence of tokens into
one token.
Concretely, I'd like to detect a three-token sequence of "*", ":" and
"*", and replace it with a token of the text "*:*".
I tried SynonymFIlter but it seems it can only deal with a single
input token. "* : * => *:*" seems to be interpreted
as one input token of 5 characters "*", space, ":", space and "*".
I'm using Solr 3.5.
Background:
My tokenizer separate the three character sequence "*:*" into 3 tokens
of one character each.
The edismax parser, when given the query "*:*", i.e. find every doc,
seems to pass the entire string "*:*" to the query analyzer (I suspect
a bug.),
and feed the tokenized result to DisjunctionMaxQuery object,
according to this debug output:
<lst name="debug">
<str name="rawquerystring">*:*</str>
<str name="querystring">*:*</str>
<str name="parsedquery">+MatchAllDocsQuery(*:*)
DisjunctionMaxQuery((body:"* : *"~100^0.5 | title:"* :
*"~100^1.2)~0.01)</str>
<str name="parsedquery_toString">+*:* (body:"* : *"~100^0.5 | title:"*
: *"~100^1.2)~0.01</str>
Notice that there is a space between * and : in
DisjunctionMaxQuery((body:"* : *" ....)
Probably because of this, the hit score is as low as 0.109, while it
is 1.000 if an analyzer that doesn't break "*:*" is used.
So I'd like to stitch together "*", ":", "*" into "*:*" again to make
DisjunctionMaxQuery happy.
Thanks.
T. "Kuro" Kurosaka