from:"cloax"

ExtractRequestHandler - not properly indexing office docs?

2009-06-19 Thread cloax


Hi there, 

I've got a Solr instance running and am feeding it rich binary documents to
index from a Django application. The setup works just fine with pdf's, etc..
but no matter what type of MS Word document ( doc and docx ) I feed it I
can't get any results when searching for content-related queries.

I've curl'd with extract.only to verify that Solr ( and tika ) could extract
the contents, and it happily enough spits back the extracted XHTML to me.
That content never seems to find it's way into the ext.def.fl that I have
specified. 

When I go and search for terms specific to content in those documents, I get
zero hits. However I get hits on metadata related queries ( ie: i store
username of who uploaded it, etc.. ) 

Is there some magical bit I forgot to flip?

cheers,
joe
-- 
View this message in context: 
http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24120125.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ExtractRequestHandler - not properly indexing office docs?

2009-06-20 Thread cloax


Thanks for the quick response.

Here are the fields from the schema:

 
 
 
 
 
 


I use text as the content field for the default field for the ERH.

Here's the config of the ERH:


  
last_modified
true
  


Here's the output of a curl request w/ the file:



0650
;
  
  
  </head>
  <body>
  <div class="package-entry">
<h1>[Content_Types].xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"/>;

</div>
<div class="package-entry">
<h1>_rels/.rels</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"><?xml version="1.0"
encoding="UTF-8" standalone="yes"?>
<Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship
Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties";
Target="docProps/app.xml"/><Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
Target="word/document.xml"/><Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail";
Target="docProps/thumbnail.jpeg"/><Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties";
Target="docProps/core.xml"/></Relationships></p>

</div>
<div class="package-entry">
<h1>word/_rels/document.xml.rels</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"><?xml version="1.0"
encoding="UTF-8" standalone="yes"?>
<Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship
Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable";
Target="fontTable.xml"/><Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles";
Target="styles.xml"/><Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings";
Target="settings.xml"/><Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings";
Target="webSettings.xml"/><Relationship Id="rId5"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme";
Target="theme/theme1.xml"/></Relationships></p>

</div>
<div class="package-entry">
<h1>word/document.xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml">Lorem ipsum dolor sit
amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
in culpa qui officia deserunt mollit anim id est laborum</p>

</div>
<div class="package-entry">
<h1>word/theme/theme1.xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"/>;

</div>
<div class="package-entry">
<h1>docProps/thumbnail.jpeg</h1>
</div>
<div class="package-entry">
<h1>word/settings.xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"/>;

</div>
<div class="package-entry">
<h1>word/fontTable.xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"/>;

</div>
<div class="package-entry">
<h1>word/webSettings.xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"/>;

</div>
<div class="package-entry">
<h1>docProps/core.xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml">Joe
Doe12009-06-17T20:29:00Z2009-06-17T20:41:00Z</p>

</div>
<div class="package-entry">
<h1>word/styles.xml</h1>
<p
  xmlns="http://www.w3.org/1999/xhtml"/>;

</div>
<div class="package-entry">
<h1>docProps/app.xml</h1>
<p xmlns="http://www.w3.org/1999/xhtml">Normal.dotm1100Microsoft
Macintosh Word011false10genfalse0falsefalse12.</p>

</div>
</body>
</html>
myfileafetest.docxapplication/octet-streamapplication/zip38200


Query looks like:

INFO: [] webapp=/solr path=/select
params={wt=standard&rows=10&start=0&explainOther=&hl.fl=&indent=on&q=text:laborum+AND+uploaded_by_user:joe&fl=*,score&qt=standard&version=2.2}
hits=0 status=0 QTime=3

Please note that searching solely by "uploaded_by_user:joe" will properly
return the document.

Thanks again.

-joe


Grant Ingersoll-6 wrote:
> 
> Can you share your schema for the fields you are indexing, the  
> configuration of the ExtractingRequestHandler and what your requests  
> look like?  Also, can you share what the output of the extract only  
> stuff looks like?
> 
> Also, can you post .doc files to the example per
> http://wiki.apache.org/solr/ExtractingRequestHandler 
>   ?  I was able to do that and search for the doc that I entered and  
> it was able to handle both .doc and .docx.
> 
> -Grant
> 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com</pre></span>
</blockquote><br>

<h3><span class=subject><a href="/solr-user@lucene.apache.org/msg23059.html">Re: ExtractRequestHandler - not properly indexing office docs?</a></span></h3>
<div class="darkgray font13">
<span class="sender pipe">
<span class=date><a href="/search?l=solr-user%40lucene.apache.org&q=date:20090622&o=newest&f=1">2009-06-22</a></span></span>
<span class="sender pipe">
<span class=thead><a href="/search?l=solr-user%40lucene.apache.org&q=subject:%22ExtractRequestHandler+%5C-+not+properly+indexing+office+docs%5C%3F%22&o=newest&f=1">Thread</a></span></span>
<span class=name><a href="/search?l=solr-user%40lucene.apache.org&q=from:%22cloax%22&o=newest&f=1"><B>cloax</B></a></span>
</div>
<blockquote><span class="msgFragment"><pre>

Yep, I've tried both of those and still no joy. Here's both my curl statement
and the resulting Solr log output. 

curl
http://localhost:8983/solr/update/extract?ext.def.fl=text\&ext.literal.id=1\&ext.map.div=text\&ext.capture=div
-F "myfi...@dj_character.doc"  

Curls output:


0317


Solr log:
Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
status=0 QTime=544 
Jun 22, 2009 12:22:26 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {add=[1]} 0 317
Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
status=0 QTime=317 
Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={wt=standard&rows=10&start=0&explainOther=&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2}
hits=0 status=0 QTime=2

The submitted document has "kondel" in it numerous times, so Solr should
have a hit. Yet it returns nothing. I also made sure I committed, but that
didn't seem to help either.


Grant Ingersoll-6 wrote:
> 
> Do you have a default field declared?  &ext.default.fl=
> Either that, or you need to explicitly capture the fields you are  
> interested in using &ext.capture=
> 
> You could add this to your curl statement to try out.
> 
> -Grant
> 


-- 
View this message in context: 
http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html
Sent from the Solr - User mailing list archive at Nabble.com.


</pre></span>
</blockquote><br>

<h3><span class=subject><a href="/solr-user@lucene.apache.org/msg23074.html">Re: ExtractRequestHandler - not properly indexing office docs?</a></span></h3>
<div class="darkgray font13">
<span class="sender pipe">
<span class=date><a href="/search?l=solr-user%40lucene.apache.org&q=date:20090622&o=newest&f=1">2009-06-22</a></span></span>
<span class="sender pipe">
<span class=thead><a href="/search?l=solr-user%40lucene.apache.org&q=subject:%22ExtractRequestHandler+%5C-+not+properly+indexing+office+docs%5C%3F%22&o=newest&f=1">Thread</a></span></span>
<span class=name><a href="/search?l=solr-user%40lucene.apache.org&q=from:%22cloax%22&o=newest&f=1"><B>cloax</B></a></span>
</div>
<blockquote><span class="msgFragment"><pre>

I've tried 'text' ( taken from the example config ) and then tried creating a
new field called doc_content and using that. Neither has worked. 
 

Grant Ingersoll-6 wrote:
> 
> What's your default search field?
> 
> On Jun 22, 2009, at 12:29 PM, cloax wrote:
> 
>>
>> Yep, I've tried both of those and still no joy. Here's both my curl  
>> statement
>> and the resulting Solr log output.
>>
>> curl
>> http://localhost:8983/solr/update/extract?ext.def.fl=text 
>> \&ext.literal.id=1\&ext.map.div=text\&ext.capture=div
>> -F "myfi...@dj_character.doc"
>>
>> Curls output:
>> 
>> 
>> 0> name="QTime">317
>> 
>>
>> Solr log:
>> Jun 22, 2009 12:21:42 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract
>> params 
>> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
>> status=0 QTime=544
>> Jun 22, 2009 12:22:26 PM  
>> org.apache.solr.update.processor.LogUpdateProcessor
>> finish
>> INFO: {add=[1]} 0 317
>> Jun 22, 2009 12:22:26 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract
>> params 
>> ={ext.map.div=text&ext.def.fl=text&ext.capture=div&ext.literal.id=1}
>> status=0 QTime=317
>> Jun 22, 2009 12:22:37 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/select
>> params 
>> = 
>> {wt 
>> = 
>> standard 
>> &rows 
>> = 
>> 10 
>> &start 
>> = 
>> 0 
>> &explainOther 
>> =&hl.fl=&indent=on&q=kondel&fl=*,score&qt=standard&version=2.2}
>> hits=0 status=0 QTime=2
>>
>> The submitted document has "kondel" in it numerous times, so Solr  
>> should
>> have a hit. Yet it returns nothing. I also made sure I committed,  
>> but that
>> didn't seem to help either.
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> Do you have a default field declared?  &ext.default.fl=
>>> Either that, or you need to explicitly capture the fields you are
>>> interested in using &ext.capture=
>>>
>>> You could add this to your curl statement to try out.
>>>
>>> -Grant
>>>
>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24150763.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24159267.html
Sent from the Solr - User mailing list archive at Nabble.com.


</pre></span>
</blockquote><br>
    <h2></h2>
  </div>
  <div class="aside" role="complementary">
    <div class="logo">
      <a href="/"><img src="/logo.png" width=247 height=88 alt="The Mail Archive"></a>
    </div>
    <h2>4 matches</h2>
    <br>
    
<ul><li><a href="/search?l=solr-user%40lucene.apache.org&q=from%3A%22cloax%22&a=1&f=1">Advanced search</a></li></ul>
<form class="overflow" action="/search" method="get">
<input type="hidden" name="l" value="solr-user@lucene.apache.org">
<label class="hidden" for="q">Search the list</label>
<input class="submittext" type="text" id="q" name="q" placeholder="Search solr-user" value="from:"cloax"">
<input class="submitbutton" id="submit" type="image" src="/submit.png" alt="Submit">
</form>

    
    <div class="nav margintop" id="nav" role="navigation">
      <h2 class="hidden">
                               Site Navigation
      </h2>
      <ul class="icons font16">
        <li class="icons-home"><a href="/">The Mail Archive home</a></li>
        <li class="icons-list">
          <a href="/solr-user@lucene.apache.org" title="c" id="c">solr-user - all messages</a></li>
        <li class="icons-about">
          <a href="/solr-user@lucene.apache.org/info.html">solr-user  - about the list</a></li>
        <li class="icons-expand"><a href="/search?l=solr-user%40lucene.apache.org&q=from%3A%22cloax%22" title="e" id="e">Expand</a></li>
      </ul>
    </div>

    <div class="listlogo margintopdouble">
      <h2 class="hidden">
  				Mail list logo
      </h2>
      
    </div>
  </div>
  <div class="footer" role="contentinfo">
    <h2 class="hidden">
	        	      Footer information
    </h2>
    <ul>
      <li><a href="/">The Mail Archive home</a></li>
      <li><a href="/faq.html#newlist">Add your mailing list</a></li>
      <li><a href="/faq.html">FAQ</a></li>
      <li><a href="/faq.html#support">Support</a></li>
      <li><a href="/faq.html#privacy">Privacy</a></li>
    </ul>
  </div>
<script language="javascript" type="text/javascript">
document.onkeydown = NavigateThrough;
function NavigateThrough (event)
{
  if (!document.getElementById) return;
  if (window.event) event = window.event;
  if (event.target.tagName == 'INPUT') return;
  if (event.ctrlKey || event.metaKey) return;
  var link = null;
  switch (event.keyCode ? event.keyCode : event.which ? event.which : null) {
    case 69:
      link = document.getElementById ('e');
      break;
    }
  if (link && link.href) document.location = link.href;
}
</script>
</body>
</html>