Re: Index & search questions; special cases

Chris Hostetter Sun, 12 Nov 2006 20:37:44 -0800

: - Let's say I index "HIV-1" with <filter
: class="solr.WordDelimiterFilterFactory" generateWordParts="1"
: generateNumberParts="1" catenateWords="1" catenateNumbers="1"
: catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which
: after parsing by the above filter would yield HIV1 or HIV 1) also find
: documents which have HIV and the number "1" somewhere in the document,
: but not directly after HIV? If so, how should I fix this? I could boost
: score by proximity, but I'm doing a sort on date anyway, so I guess it
: would be pointless to do so.


A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questions really depend on
wether you use WordDelim at both index time and query time (or if you do
use it in both cases, but configure it differnetly)

Have you by any chance played with the "Analysis" page on your Solr index?
  
http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on&;

...it makes it really easy to see exactly how your various fields will get
parsed at index time and query time.  I would also suggest you use the
"debugQuery=on" option when doing some searches -- even if there aren't
nay documents in your index, that will help you see how your query is
getting parsed and what Query structure QueryParser is building based on
the tokens it gets from each of hte Anaalyzers.

: - Somewhat related : Let's say I index "Polymyxin B". If I stopword
: single letters, would a phrase search ("Polymyxin B") still find the
: right documents (I don't think so, but still)? If not, I'll have to

depends on what the "right documents" are .. if you strip stopwords out
both at index time and at query time then it will ultimately match exctly
the same thing as a query on "Polymyxin" which i guess must be the "right
documents" since no documents will container the letter "B" so what else
could be right? :)

: index single letters; how do I prevent the same problem as in the first
: question (i.e., a search on Polymyxin B yielding documents with
: Polymyxin and B, but not close to one another).
:
: My thought is to parse the user query and rephrase it to do phrase
: searches on nearby terms containing single letters / numbers. If an user
: search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
: ("1 hepatitis" AND hiv). Is it a sensible solution?

that's kind of a strange behavior for a search application to have ... you
might just wnat to trust that your users will be smart and if they find
that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near
"HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or
'HIV "1 hepatits"' if that's what they ment.)




-Hoss

Re: Index & search questions; special cases

Reply via email to