Hi Dave,
On Wed, Jul 17, 2013 at 2:03 PM, dmarini <david.marini...@gmail.com> wrote: > Roman, > > As a developer, I understand where you are coming from. My issue is that I > specialize in .NET, haven't done java dev in over 10 years. As an > organization we're new to solr (coming from endeca) and we're looking to > use > it more across the organization, so for us, we are looking to do the > classic > time/payoff justification for most features that are causing a bit of > friction. I have seen custom query parsers that are out there that seem > like > they will do what we're looking to do, but I worry that they might fix a > custom case and not necessarily work for us. > been in the same position 2 years back, that's why I have developed the ANTLR query parser (before that, I went through the phase of hacking different query parsers, but it was always obvious to me it cannot work for anything but simple cases) > > Also, Roman, are you suggesting that I can have an indexed document titled > "hubble telescope" and as long as I separate multi-word synonyms with the > null character \0 in the synonyms.txt file the query expansion will just > work? if so, that would suffice for our needs.. can you elaborate or will the query parser still foil the system. I ask because I've seen instances > First, bit of explanation of indexing/tokenization operates: input text: "hubble space telescope is in the space" let's say we are tokenizing on empty space and we use stopwords; this is what gets indexed: hubble space telescope space these tokens can have different positions, but let's ignore that for a moment - the first three are adjacent > where I can use the admin analysis tool against a custom field type to > expand a multi-word synonym where it appears it's expanding the terms > properly but when I run a search against it using the actual handler, it > doesn't behave the same way and the debugQuery shows that indeed it split > my > term and did not expand it. > this is because the solr analysis tool is seeing the whole input as one string "hubble space telescope", WHILST the standard query parser first tokenizes, then builds the query *out of every token* - so it is seeing 3 tokens instead of 1 big token, and builds the following query field:hubble field:space field:telescope field:space HOWEVER, when you send the phrase query, it arrives as one token - the synonym filter will see it, it will recognize it as a multi-token synonym and it will expand it BUT, the standard behaviour is to insert the new token into the position of the first token, so you will get a phrase query "(hubble | HST) space telescope space" So really, the problem of the multi-token synonym expansion is in essence a problem of a query parser - it must know how to harvest tokens, expand them, and how to build a proper query - int this case, the HST [one token] spans over 3 original tokens, so the parser must be smart enough to build: "hubble space telescope space" OR "HST in the space" So, the synonym expansion part is standard FST, already in the Lucene/SOLR core. The parser that can handle these cases (and not just them, but also many others) is also inside Lucene - it is called 'flexible' and has been contributed by IBM few years back. But so far it has been a sleeping beauty. I haven't seen LucidWorks parser, but from the description it seems it does much better job than the standard parser (if, when you do quoted phrase search for "hubble space telescope in the space" and the result is: "hubble space telescope space" OR "HST in the space", you can be reasonably sure it does everything - well, to be 100% sure: "HST in the space" should also produce the same query; but that's a much longer discussion about index-time XOR query-time analysis) roman > > Jack, > > Is there a link where I can read more about the LucidWorks search parser > and > how we can perchance tie into that so I can test to see if it yields better > results? > > Thanks again for the help and suggestions. As an organization, we've > learned > much of solr since we started in 4.1 (especially with the cloud). The devs > are doing phenomenal work and my query is really meant more as confirmation > that I'm taking the correct approach than to beg for a specific feature :) > > --Dave > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078675.html > Sent from the Solr - User mailing list archive at Nabble.com. >