I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail to index the proper content, both for .ODP and .ODT files.

If I do extractOnly=true&extractFormat=text, I see the extracted text clearly in addition to the metadata.

I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the problem. I just see spaces in both cases.

But whether the problem is due to Solr or Tika, is not apparent.

In any case, a Jira is warranted.

-- Jack Krupansky

-----Original Message----- From: Sebastián Ramírez
Sent: Friday, May 10, 2013 11:24 AM
To: solr-user@lucene.apache.org
Subject: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

Hello everyone,

I'm having a problem indexing content from "opendocument format" files. The
files created with OpenOffice and LibreOffice (odt, ods...).

Tika is being able to read the files but Solr is not indexing the content.

It's not a problem of commiting or something like that, after I post a file
it is indexed and all the metadata is indexed/stored but the content isn't
there.


  - I modified the solrconfig.xml file to catch everything:

<requestHandler name="/update/extract"...

   <!-- here is the interesting part -->

   <!-- <str name="uprefix">ignored_</str> -->
   <str name="defaultField">all_txt</str>



  - Then I submitted the file to Solr:

curl '
http://localhost:8983/solr/update/extract?commit=true&literal.id=newods' -H
'Content-type: application/vnd.oasis.opendocument.spreadsheet'
--data-binary @test_ods.ods



  - Now when I do a search in Solr I get this result, there is something
  in the "content", but that's not the actual content of the original file:

<result name="response" numFound="1" start="0">
 <doc>
   <str name="id">newods</str>
   <arr name="all_txt">
     <str>1</str>
     <str>2013-05-03T10:02:10.58</str>
     <str>2013-05-03T10:02:50.54</str>
     <str>2013-05-03T10:02:50.54</str>
     <str>1</str>
     <str>2013-05-03T10:02:10.58</str>
     <str>1</str>
     <str>2013-05-03T10:02:50.54</str>
     <str>2013-05-03T10:02:50.54</str>
     <str>0</str>
     <str>P0D</str>
     <str>2013-05-03T10:02:10.58</str>
     <str>1</str>
     <str>0</str>
     <str>application/ods</str>
     <str>0</str>
     <str>7322</str>
     <str>LibreOffice/4.0.2.2$Windows_x86
LibreOffice_project/4c82dcdd6efcd48b1d8bba66bfe1989deee49c3</str>
     <str>2013-05-03T10:02:50.54</str>
   </arr>
   <date name="last_modified">2013-05-03T10:02:50Z</date>
   <arr name="content_type">
     <str>application/vnd.oasis.opendocument.spreadsheet</str>
   </arr>
   <arr name="content">
     <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /    </str>
   </arr>
<long name="_version_">1434658995848609792</long></doc></result></response>


  - I ask Solr to show me the extracted content from Tika doing this:

curl 'http://localhost:8983/solr/update/extract?extractOnly=true' -H
'Content-type: application/vnd.oasis.opendocument.spreadsheet'
--data-binary @test_ods.ods



  - And I get the XHTML extracted from Tika, including the original file
  contents and that final part that Solr is indeed indexing, so, Tika is
  being able to read the file but Solr is not indexing the real content, it
  only indexes the rest:

<body>
<table>
<tr>
   <td>
       <p>test</p>
   </td>
</tr>
<tr>
   <td>
       <p>de</p>
   </td>
</tr>
<tr>
   <td>
       <p>ods</p>
   </td>
</tr>
</table>

<p xmlns="http://www.w3.org/1999/xhtml";>???</p>
<p>Page</p>
<p>??? (???)</p>
<p>00/00/0000, 00:00:00</p>
<p>Page / </p>
</body>

Do any of you know how to fix/workaround this problem?

Thanks!

Sebastián Ramírez

--
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for
the named recipient(s) and may contain information that is privileged,
confidential and/or exempt from disclosure under applicable law. If you
have received this transmission in error, or are not the named
recipient(s), please notify Senseta immediately by return e-mail and
permanently delete this transmission, including any attachments.*

Reply via email to