Re: ExtractRequestHandler - not properly indexing office docs?

Grant Ingersoll Fri, 19 Jun 2009 19:01:16 -0700

Can you share your schema for the fields you are indexing, theconfiguration of the ExtractingRequestHandler and what your requestslook like? Also, can you share what the output of the extract onlystuff looks like?

Also, can you post .doc files to the example per http://wiki.apache.org/solr/ExtractingRequestHandler? I was able to do that and search for the doc that I entered andit was able to handle both .doc and .docx.


-Grant

On Jun 19, 2009, at 7:20 PM, cloax wrote:

Hi there,
I've got a Solr instance running and am feeding it rich binarydocuments toindex from a Django application. The setup works just fine withpdf's, etc..but no matter what type of MS Word document ( doc and docx ) I feedit I
can't get any results when searching for content-related queries.
I've curl'd with extract.only to verify that Solr ( and tika ) couldextractthe contents, and it happily enough spits back the extracted XHTMLto me.That content never seems to find it's way into the ext.def.fl that Ihave
specified.
When I go and search for terms specific to content in thosedocuments, I getzero hits. However I get hits on metadata related queries ( ie: istore
username of who uploaded it, etc.. )

Is there some magical bit I forgot to flip?

cheers,
joe
--
View this message in context: 
http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24120125.html
Sent from the Solr - User mailing list archive at Nabble.com.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: ExtractRequestHandler - not properly indexing office docs?

Reply via email to