Re: Searching w/explicit Multi-Word Synonym Expansion

Roman Chyla Wed, 17 Jul 2013 14:51:14 -0700

Hi Dave,



On Wed, Jul 17, 2013 at 2:03 PM, dmarini <david.marini...@gmail.com> wrote:

> Roman,
>
> As a developer, I understand where you are coming from. My issue is that I
> specialize in .NET, haven't done java dev in over 10 years. As an
> organization we're new to solr (coming from endeca) and we're looking to
> use
> it more across the organization, so for us, we are looking to do the
> classic
> time/payoff justification for most features that are causing a bit of
> friction. I have seen custom query parsers that are out there that seem
> like
> they will do what we're looking to do, but I worry that they might fix a
> custom case and not necessarily work for us.
>

been in the same position 2 years back, that's why I have developed the
ANTLR query parser (before that, I went through the phase of hacking
different query parsers, but it was always obvious to me it cannot work for
anything but simple cases)


>
> Also, Roman, are you suggesting that I can have an indexed document titled
> "hubble telescope" and as long as I separate multi-word synonyms with the
> null character \0 in the synonyms.txt file the query expansion will just
> work? if so, that would suffice for our needs.. can you elaborate or will

the query parser still foil the system. I ask because I've seen instances
>

First, bit of explanation of indexing/tokenization operates:

input text: "hubble space telescope is in the space"

let's say we are tokenizing on empty space and we use stopwords; this is
what gets indexed:

hubble
space
telescope
space

these tokens can have different positions, but let's ignore that for a
moment - the first three are adjacent


> where I can use the admin analysis tool against a custom field type to
> expand a multi-word synonym where it appears it's expanding the terms
> properly but when I run a search against it using the actual handler, it
> doesn't behave the same way and the debugQuery shows that indeed it split
> my
> term and did not expand it.
>

this is because the solr analysis tool is seeing the whole input as one
string "hubble space telescope", WHILST the standard query parser first
tokenizes, then builds the query *out of every token* - so it is seeing 3
tokens instead of 1 big token, and builds the following query

field:hubble field:space field:telescope field:space

HOWEVER, when you send the phrase query, it arrives as one token - the
synonym filter will see it, it will recognize it as a multi-token synonym
and it will expand it

BUT, the standard behaviour is to insert the new token into the position of
the first token, so you will get a phrase query

"(hubble | HST) space telescope space"

So really, the problem of the multi-token synonym expansion is in essence a
problem of a query parser - it must know how to harvest tokens, expand
them, and how to build a proper query - int this case, the HST [one token]
spans over 3 original tokens, so the parser must be smart enough to build:

"hubble space telescope space" OR "HST in the space"

So, the synonym expansion part is standard FST, already in the Lucene/SOLR
core. The parser that can handle these cases (and not just them, but also
many others) is also inside Lucene - it is called 'flexible' and has been
contributed by IBM few years back. But so far it has been a sleeping beauty.

I haven't seen LucidWorks parser, but from the description it seems it does
much better job than the standard parser (if, when you do quoted phrase
search for "hubble space telescope in the space" and the result is: "hubble
space telescope space" OR "HST in the space", you can be reasonably sure it
does everything - well, to be 100% sure: "HST in the space" should also
produce the same query; but that's a much longer discussion about
index-time XOR query-time analysis)

roman



>
> Jack,
>
> Is there a link where I can read more about the LucidWorks search parser
> and
> how we can perchance tie into that so I can test to see if it yields better
> results?
>
> Thanks again for the help and suggestions. As an organization, we've
> learned
> much of solr since we started in 4.1 (especially with the cloud). The devs
> are doing phenomenal work and my query is really meant more as confirmation
> that I'm taking the correct approach than to beg for a specific feature :)
>
> --Dave
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078675.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Searching w/explicit Multi-Word Synonym Expansion

Reply via email to