Can you share your schema for the fields you are indexing, the
configuration of the ExtractingRequestHandler and what your requests
look like? Also, can you share what the output of the extract only
stuff looks like?
Also, can you post .doc files to the example per http://wiki.apache.org/solr/ExtractingRequestHandler
? I was able to do that and search for the doc that I entered and
it was able to handle both .doc and .docx.
-Grant
On Jun 19, 2009, at 7:20 PM, cloax wrote:
Hi there,
I've got a Solr instance running and am feeding it rich binary
documents to
index from a Django application. The setup works just fine with
pdf's, etc..
but no matter what type of MS Word document ( doc and docx ) I feed
it I
can't get any results when searching for content-related queries.
I've curl'd with extract.only to verify that Solr ( and tika ) could
extract
the contents, and it happily enough spits back the extracted XHTML
to me.
That content never seems to find it's way into the ext.def.fl that I
have
specified.
When I go and search for terms specific to content in those
documents, I get
zero hits. However I get hits on metadata related queries ( ie: i
store
username of who uploaded it, etc.. )
Is there some magical bit I forgot to flip?
cheers,
joe
--
View this message in context:
http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24120125.html
Sent from the Solr - User mailing list archive at Nabble.com.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search