Thanks for the quick response.

Here are the fields from the schema:

 <field name="id" type="string" indexed="true" stored="true" required="true"
/>
 <field name="original_name" type="text" indexed="true" stored="true"/>
 <field name="current" type="boolean" indexed="true" stored="true"/>
 <field name="file_association" type="sint" indexed="true" stored="true"/>
 <field name="uploaded_by_user" type="text" indexed="true" stored="true"/>
 <field name="text" type="text" indexed="true" stored="false"
multiValued="true"/>


I use text as the content field for the default field for the ERH.

Here's the config of the ERH:

<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="ext.map.Last-Modified">last_modified</str>
    <bool name="ext.ignore.und.fl">true</bool>
  </lst>
</requestHandler>

Here's the output of a curl request w/ the file:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">650</int></lst><str name="afetest.docx">&lt;?xml version="1.0"
encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
  &lt;head&gt;
      &lt;title/&gt;
  &lt;/head&gt;
  &lt;body&gt;
      &lt;div class="package-entry"&gt;
&lt;h1&gt;[Content_Types].xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;_rels/.rels&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;&amp;lt;?xml version="1.0"
encoding="UTF-8" standalone="yes"?&amp;gt;&amp;#xd;
&amp;lt;Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships"&amp;gt;&amp;lt;Relationship
Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties";
Target="docProps/app.xml"/&amp;gt;&amp;lt;Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
Target="word/document.xml"/&amp;gt;&amp;lt;Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail";
Target="docProps/thumbnail.jpeg"/&amp;gt;&amp;lt;Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties";
Target="docProps/core.xml"/&amp;gt;&amp;lt;/Relationships&amp;gt;&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/_rels/document.xml.rels&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;&amp;lt;?xml version="1.0"
encoding="UTF-8" standalone="yes"?&amp;gt;&amp;#xd;
&amp;lt;Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships"&amp;gt;&amp;lt;Relationship
Id="rId4"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable";
Target="fontTable.xml"/&amp;gt;&amp;lt;Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles";
Target="styles.xml"/&amp;gt;&amp;lt;Relationship Id="rId2"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings";
Target="settings.xml"/&amp;gt;&amp;lt;Relationship Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings";
Target="webSettings.xml"/&amp;gt;&amp;lt;Relationship Id="rId5"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme";
Target="theme/theme1.xml"/&amp;gt;&amp;lt;/Relationships&amp;gt;&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/document.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;Lorem ipsum dolor sit
amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
in culpa qui officia deserunt mollit anim id est laborum&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/theme/theme1.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;docProps/thumbnail.jpeg&lt;/h1&gt;
&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/settings.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/fontTable.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/webSettings.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;docProps/core.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"&gt;Joe
Doe12009-06-17T20:29:00Z2009-06-17T20:41:00Z&lt;/p&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;word/styles.xml&lt;/h1&gt;
&lt;p
          xmlns="http://www.w3.org/1999/xhtml"/&gt;

&lt;/div&gt;
&lt;div class="package-entry"&gt;
&lt;h1&gt;docProps/app.xml&lt;/h1&gt;
&lt;p xmlns="http://www.w3.org/1999/xhtml"&gt;Normal.dotm1100Microsoft
Macintosh Word011false10genfalse0falsefalse12.0000&lt;/p&gt;

&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;
</str><lst name="afetest.docx_metadata"><arr
name="stream_source_info"><str>myfile</str></arr><arr
name="stream_name"><str>afetest.docx</str></arr><arr
name="stream_content_type"><str>application/octet-stream</str></arr><arr
name="Content-Type"><str>application/zip</str></arr><arr
name="stream_size"><str>38200</str></arr></lst>
</response>

Query looks like:

INFO: [] webapp=/solr path=/select
params={wt=standard&rows=10&start=0&explainOther=&hl.fl=&indent=on&q=text:laborum+AND+uploaded_by_user:joe&fl=*,score&qt=standard&version=2.2}
hits=0 status=0 QTime=3

Please note that searching solely by "uploaded_by_user:joe" will properly
return the document.

Thanks again.

-joe


Grant Ingersoll-6 wrote:
> 
> Can you share your schema for the fields you are indexing, the  
> configuration of the ExtractingRequestHandler and what your requests  
> look like?  Also, can you share what the output of the extract only  
> stuff looks like?
> 
> Also, can you post .doc files to the example per
> http://wiki.apache.org/solr/ExtractingRequestHandler 
>   ?  I was able to do that and search for the doc that I entered and  
> it was able to handle both .doc and .docx.
> 
> -Grant
> 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24124928.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to