libreoffice) in Solr 4.2.1

Sebastián Ramírez Fri, 10 May 2013 08:25:39 -0700

Hello everyone,

I'm having a problem indexing content from "opendocument format" files. The
files created with OpenOffice and LibreOffice (odt, ods...).


Tika is being able to read the files but Solr is not indexing the content.

It's not a problem of commiting or something like that, after I post a file
it is indexed and all the metadata is indexed/stored but the content isn't
there.


   - I modified the solrconfig.xml file to catch everything:

<requestHandler name="/update/extract"...

    <!-- here is the interesting part -->

    <!-- <str name="uprefix">ignored_</str> -->
    <str name="defaultField">all_txt</str>



   - Then I submitted the file to Solr:

curl '
http://localhost:8983/solr/update/extract?commit=true&literal.id=newods' -H
'Content-type: application/vnd.oasis.opendocument.spreadsheet'
--data-binary @test_ods.ods



   - Now when I do a search in Solr I get this result, there is something
   in the "content", but that's not the actual content of the original file:

<result name="response" numFound="1" start="0">
  <doc>
    <str name="id">newods</str>
    <arr name="all_txt">
      <str>1</str>
      <str>2013-05-03T10:02:10.58</str>
      <str>2013-05-03T10:02:50.54</str>
      <str>2013-05-03T10:02:50.54</str>
      <str>1</str>
      <str>2013-05-03T10:02:10.58</str>
      <str>1</str>
      <str>2013-05-03T10:02:50.54</str>
      <str>2013-05-03T10:02:50.54</str>
      <str>0</str>
      <str>P0D</str>
      <str>2013-05-03T10:02:10.58</str>
      <str>1</str>
      <str>0</str>
      <str>application/ods</str>
      <str>0</str>
      <str>7322</str>
      <str>LibreOffice/4.0.2.2$Windows_x86
LibreOffice_project/4c82dcdd6efcd48b1d8bba66bfe1989deee49c3</str>
      <str>2013-05-03T10:02:50.54</str>
    </arr>
    <date name="last_modified">2013-05-03T10:02:50Z</date>
    <arr name="content_type">
      <str>application/vnd.oasis.opendocument.spreadsheet</str>
    </arr>
    <arr name="content">
      <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /    </str>
    </arr>
    <long name="_version_">1434658995848609792</long></doc></result></response>


   - I ask Solr to show me the extracted content from Tika doing this:

curl 'http://localhost:8983/solr/update/extract?extractOnly=true' -H
'Content-type: application/vnd.oasis.opendocument.spreadsheet'
--data-binary @test_ods.ods



   - And I get the XHTML extracted from Tika, including the original file
   contents and that final part that Solr is indeed indexing, so, Tika is
   being able to read the file but Solr is not indexing the real content, it
   only indexes the rest:

<body>
<table>
<tr>
    <td>
        <p>test</p>
    </td>
</tr>
<tr>
    <td>
        <p>de</p>
    </td>
</tr>
<tr>
    <td>
        <p>ods</p>
    </td>
</tr>
</table>

<p xmlns="http://www.w3.org/1999/xhtml";>???</p>
<p>Page</p>
<p>??? (???)</p>
<p>00/00/0000, 00:00:00</p>
<p>Page / </p>
</body>

Do any of you know how to fix/workaround this problem?

Thanks!

Sebastián Ramírez

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

Reply via email to