: - Let's say I index "HIV-1" with <filter : class="solr.WordDelimiterFilterFactory" generateWordParts="1" : generateNumberParts="1" catenateWords="1" catenateNumbers="1" : catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which : after parsing by the above filter would yield HIV1 or HIV 1) also find : documents which have HIV and the number "1" somewhere in the document, : but not directly after HIV? If so, how should I fix this? I could boost : score by proximity, but I'm doing a sort on date anyway, so I guess it : would be pointless to do so.
A couple of things make your question really hard to answer ... first off, you can specify differnet analyser chains for index time and query time -- shen dealing with the WordDelim filter (or the synonym fitler) this is frequently neccessary -- so the ansers to your questions really depend on wether you use WordDelim at both index time and query time (or if you do use it in both cases, but configure it differnetly) Have you by any chance played with the "Analysis" page on your Solr index? http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on& ...it makes it really easy to see exactly how your various fields will get parsed at index time and query time. I would also suggest you use the "debugQuery=on" option when doing some searches -- even if there aren't nay documents in your index, that will help you see how your query is getting parsed and what Query structure QueryParser is building based on the tokens it gets from each of hte Anaalyzers. : - Somewhat related : Let's say I index "Polymyxin B". If I stopword : single letters, would a phrase search ("Polymyxin B") still find the : right documents (I don't think so, but still)? If not, I'll have to depends on what the "right documents" are .. if you strip stopwords out both at index time and at query time then it will ultimately match exctly the same thing as a query on "Polymyxin" which i guess must be the "right documents" since no documents will container the letter "B" so what else could be right? :) : index single letters; how do I prevent the same problem as in the first : question (i.e., a search on Polymyxin B yielding documents with : Polymyxin and B, but not close to one another). : : My thought is to parse the user query and rephrase it to do phrase : searches on nearby terms containing single letters / numbers. If an user : search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR : ("1 hepatitis" AND hiv). Is it a sensible solution? that's kind of a strange behavior for a search application to have ... you might just wnat to trust that your users will be smart and if they find that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near "HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or 'HIV "1 hepatits"' if that's what they ment.) -Hoss