ExtractRequestHandler - not properly indexing office docs?
Hi there, I've got a Solr instance running and am feeding it rich binary documents to index from a Django application. The setup works just fine with pdf's, etc.. but no matter what type of MS Word document ( doc and docx ) I feed it I can't get any results when searching for content-related queries. I've curl'd with extract.only to verify that Solr ( and tika ) could extract the contents, and it happily enough spits back the extracted XHTML to me. That content never seems to find it's way into the ext.def.fl that I have specified. When I go and search for terms specific to content in those documents, I get zero hits. However I get hits on metadata related queries ( ie: i store username of who uploaded it, etc.. ) Is there some magical bit I forgot to flip? cheers, joe -- View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24120125.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ExtractRequestHandler - not properly indexing office docs?
Thanks for the quick response. Here are the fields from the schema: I use text as the content field for the default field for the ERH. Here's the config of the ERH: last_modified true Here's the output of a curl request w/ the file: 0650 ;[Content_Types].xml
;_rels/.rels
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties"; Target="docProps/app.xml"/><Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"; Target="word/document.xml"/><Relationship Id="rId2" Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail"; Target="docProps/thumbnail.jpeg"/><Relationship Id="rId3" Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties"; Target="docProps/core.xml"/></Relationships>
word/_rels/document.xml.rels
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable"; Target="fontTable.xml"/><Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"; Target="styles.xml"/><Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings"; Target="settings.xml"/><Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings"; Target="webSettings.xml"/><Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"; Target="theme/theme1.xml"/></Relationships>
word/document.xml
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
word/theme/theme1.xml
;docProps/thumbnail.jpeg
word/settings.xml
;word/fontTable.xml
;word/webSettings.xml
;docProps/core.xml
Joe Doe12009-06-17T20:29:00Z2009-06-17T20:41:00Z
word/styles.xml
;myfileafetest.docxapplication/octet-streamapplication/zip38200 Query looks like: INFO: [] webapp=/solr path=/select params={wt=standard&rows=10&start=0&explainOther=&hl.fl=&indent=on&q=text:laborum+AND+uploaded_by_user:joe&fl=*,score&qt=standard&version=2.2} hits=0 status=0 QTime=3 Please note that searching solely by "uploaded_by_user:joe" will properly return the document. Thanks again. -joe Grant Ingersoll-6 wrote: > > Can you share your schema for the fields you are indexing, the > configuration of the ExtractingRequestHandler and what your requests > look like? Also, can you share what the output of the extract only > stuff looks like? > > Also, can you post .doc files to the example per > http://wiki.apache.org/solr/ExtractingRequestHandler > ? I was able to do that and search for the doc that I entered and > it was able to handle both .doc and .docx. > > -Grant > > > -- > Grant Ingersoll > http://www.lucidimagination.comdocProps/app.xml
Normal.dotm1100Microsoft Macintosh Word011false10genfalse0falsefalse12.
Re: ExtractRequestHandler - not properly indexing office docs?
Yep, I've tried both of those and still no joy. Here's both my curl statement and the resulting Solr log output. curl http://localhost:8983/solr/update/extract?ext.def.fl=text\&ext.literal.id=1\&ext.map.div=text\&ext.capture=div -F "myfi...@dj_character.doc" Curls output: 0317 Solr log: Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update/extract params={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1} status=0 QTime=544 Jun 22, 2009 12:22:26 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {add=[1]} 0 317 Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update/extract params={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1} status=0 QTime=317 Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={wt=standard&rows=10&start=0&explainOther=&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2} hits=0 status=0 QTime=2 The submitted document has "kondel" in it numerous times, so Solr should have a hit. Yet it returns nothing. I also made sure I committed, but that didn't seem to help either. Grant Ingersoll-6 wrote: > > Do you have a default field declared? &ext.default.fl= > Either that, or you need to explicitly capture the fields you are > interested in using &ext.capture= > > You could add this to your curl statement to try out. > > -Grant > -- View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ExtractRequestHandler - not properly indexing office docs?
I've tried 'text' ( taken from the example config ) and then tried creating a new field called doc_content and using that. Neither has worked. Grant Ingersoll-6 wrote: > > What's your default search field? > > On Jun 22, 2009, at 12:29 PM, cloax wrote: > >> >> Yep, I've tried both of those and still no joy. Here's both my curl >> statement >> and the resulting Solr log output. >> >> curl >> http://localhost:8983/solr/update/extract?ext.def.fl=text >> \&ext.literal.id=1\&ext.map.div=text\&ext.capture=div >> -F "myfi...@dj_character.doc" >> >> Curls output: >> >> >> 0> name="QTime">317 >> >> >> Solr log: >> Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute >> INFO: [] webapp=/solr path=/update/extract >> params >> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1} >> status=0 QTime=544 >> Jun 22, 2009 12:22:26 PM >> org.apache.solr.update.processor.LogUpdateProcessor >> finish >> INFO: {add=[1]} 0 317 >> Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute >> INFO: [] webapp=/solr path=/update/extract >> params >> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1} >> status=0 QTime=317 >> Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute >> INFO: [] webapp=/solr path=/select >> params >> = >> {wt >> = >> standard >> &rows >> = >> 10 >> &start >> = >> 0 >> &explainOther >> =&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2} >> hits=0 status=0 QTime=2 >> >> The submitted document has "kondel" in it numerous times, so Solr >> should >> have a hit. Yet it returns nothing. I also made sure I committed, >> but that >> didn't seem to help either. >> >> >> Grant Ingersoll-6 wrote: >>> >>> Do you have a default field declared? &ext.default.fl= >>> Either that, or you need to explicitly capture the fields you are >>> interested in using &ext.capture= >>> >>> You could add this to your curl statement to try out. >>> >>> -Grant >>> >> >> >> -- >> View this message in context: >> http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search > > > -- View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24159267.html Sent from the Solr - User mailing list archive at Nabble.com.