Actually, I think I found the issue.  Some of the PDFs weren't OCR'ed very well 
and the text from the word "examined" was read as "~8 mined"

Vincent Vu Nguyen
Division of Science Quality and Translation
Office of the Associate Director for Science
Centers for Disease Control and Prevention (CDC)
404-498-6154
Century Bldg 2400
Atlanta, GA 30329 


-----Original Message-----
From: Nguyen, Vincent (CDC/OSELS/PHITPO) (CTR) 
Sent: Wednesday, September 15, 2010 12:35 PM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: RE: Solr returning irrelevant results

Sorry about that, I made it uppercase to emphasize it.  The word was just 
"examined"

Vincent Vu Nguyen
Division of Science Quality and Translation
Office of the Associate Director for Science
Centers for Disease Control and Prevention (CDC)
404-498-6154
Century Bldg 2400
Atlanta, GA 30329 


-----Original Message-----
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Wednesday, September 15, 2010 11:40 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr returning irrelevant results

On Wed, Sep 15, 2010 at 11:29 AM, Nguyen, Vincent (CDC/OSELS/PHITPO)
(CTR) <v...@cdc.gov> wrote:
> I was running a query on the word "mining" and got results from
> documents that have nothing to do with mining.  I got results with a
> score of 0.2997284 and less.  It looks like Solr was querying the
> dsm.fulltext field for "mine" as well, which is ok except there were no
> "mine" words in the document.  However, I did find words like
> "exaMINEd".

Was the "MINE" in "exaMINEd" actually uppercase, or did you do that
for emphasis?

If it was actually uppercased, one could argue it is a relevant
document since someone was trying to get "MINE" to stand out for some
reason.

Anyway, if you don't want that behavior then turn off splitting on case change.
splitOnCaseChange="0" in WordDelimiterFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8




Reply via email to