Many thanks Jack for your attention and effort on solving the problem.

Best,

Sebastián Ramírez


On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail
> to index the proper content, both for .ODP and .ODT files.
>
> If I do extractOnly=true&**extractFormat=text, I see the extracted text
> clearly in addition to the metadata.
>
> I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the
> problem. I just see spaces in both cases.
>
> But whether the problem is due to Solr or Tika, is not apparent.
>
> In any case, a Jira is warranted.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Sebastián Ramírez
> Sent: Friday, May 10, 2013 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: Tika not extracting content from ODT / ODS (open document /
> libreoffice) in Solr 4.2.1
>
> Hello everyone,
>
> I'm having a problem indexing content from "opendocument format" files. The
> files created with OpenOffice and LibreOffice (odt, ods...).
>
> Tika is being able to read the files but Solr is not indexing the content.
>
> It's not a problem of commiting or something like that, after I post a file
> it is indexed and all the metadata is indexed/stored but the content isn't
> there.
>
>
>   - I modified the solrconfig.xml file to catch everything:
>
>
> <requestHandler name="/update/extract"...
>
>    <!-- here is the interesting part -->
>
>    <!-- <str name="uprefix">ignored_</str> -->
>    <str name="defaultField">all_txt</**str>
>
>
>
>   - Then I submitted the file to Solr:
>
>
> curl '
> http://localhost:8983/solr/**update/extract?commit=true&**
> literal.id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>'
> -H
> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
> --data-binary @test_ods.ods
>
>
>
>   - Now when I do a search in Solr I get this result, there is something
>
>   in the "content", but that's not the actual content of the original file:
>
> <result name="response" numFound="1" start="0">
>  <doc>
>    <str name="id">newods</str>
>    <arr name="all_txt">
>      <str>1</str>
>      <str>2013-05-03T10:02:10.58</**str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>1</str>
>      <str>2013-05-03T10:02:10.58</**str>
>      <str>1</str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>2013-05-03T10:02:50.54</**str>
>      <str>0</str>
>      <str>P0D</str>
>      <str>2013-05-03T10:02:10.58</**str>
>      <str>1</str>
>      <str>0</str>
>      <str>application/ods</str>
>      <str>0</str>
>      <str>7322</str>
>      <str>LibreOffice/4.0.2.2$**Windows_x86
> LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3</str>
>      <str>2013-05-03T10:02:50.54</**str>
>    </arr>
>    <date name="last_modified">2013-05-**03T10:02:50Z</date>
>    <arr name="content_type">
>      <str>application/vnd.oasis.**opendocument.spreadsheet</str>
>    </arr>
>    <arr name="content">
>      <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /    </str>
>    </arr>
>    <long name="_version_">**1434658995848609792</long></**
> doc></result></response>
>
>
>   - I ask Solr to show me the extracted content from Tika doing this:
>
>
> curl 
> 'http://localhost:8983/solr/**update/extract?extractOnly=**true<http://localhost:8983/solr/update/extract?extractOnly=true>'
> -H
> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
> --data-binary @test_ods.ods
>
>
>
>   - And I get the XHTML extracted from Tika, including the original file
>
>   contents and that final part that Solr is indeed indexing, so, Tika is
>   being able to read the file but Solr is not indexing the real content, it
>   only indexes the rest:
>
> <body>
> <table>
> <tr>
>    <td>
>        <p>test</p>
>    </td>
> </tr>
> <tr>
>    <td>
>        <p>de</p>
>    </td>
> </tr>
> <tr>
>    <td>
>        <p>ods</p>
>    </td>
> </tr>
> </table>
>
> <p xmlns="http://www.w3.org/1999/**xhtml <http://www.w3.org/1999/xhtml>
> ">???</p>
> <p>Page</p>
> <p>??? (???)</p>
> <p>00/00/0000, 00:00:00</p>
> <p>Page / </p>
> </body>
>
> Do any of you know how to fix/workaround this problem?
>
> Thanks!
>
> Sebastián Ramírez
>
> --
> *-----------------------------**-----------------------*
> *This e-mail transmission, including any attachments, is intended only for
> the named recipient(s) and may contain information that is privileged,
> confidential and/or exempt from disclosure under applicable law. If you
> have received this transmission in error, or are not the named
> recipient(s), please notify Senseta immediately by return e-mail and
> permanently delete this transmission, including any attachments.*
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Reply via email to