libreoffice) in Solr 4.2.1

Walter Underwood Fri, 10 May 2013 11:38:59 -0700

The last time I looked at those formats, they were a zip archive with the 
content in an XML file. I think it was a an obvious name, like "content.xml".


So you should be able to extract that and look at it. Opening XML in a browser 
can be helpful, because it will flag any parse errors.

wunder

On May 10, 2013, at 11:34 AM, Sebastián Ramírez wrote:

> Thanks for your reply Jack!
> 
> First: LOL
> 
> Second: I'm using the latest version of libreoffice, but with the
> "extractOnly" param in the Solr request it shows the content of the file so
> Tika is being able to read and extract the data but Solr isn't indexing
> that data.
> 
> Third: I already did that with no luck, I tried
> "application/vnd.oasis.opendocument.spreadsheet", "application/ods" and
> "application/octet-stream" but always got the same result.
> 
> Following the documentation for
> "ExtractingRequestHandler<http://wiki.apache.org/solr/ExtractingRequestHandler#Concepts>"
> I see that Tika reads the file and feeds it to a "SAX ContentHandler", and
> "Solr then reacts to Tika's SAX events and creates the fields to index". I
> think that the problem might be somewhere in that process of feeding the
> "SAX ContentHandler" or the reaction of Solr to those "SAX events".
> 
> Do you (or anyone else) know how could one configure / debug that "SAX
> ContentHandler"?
> 
> 
> Thanks,
> 
> Sebastián Ramírez
> 
> 
> 
> On Fri, May 10, 2013 at 10:57 AM, Jack Krupansky 
> <j...@basetechnology.com>wrote:
> 
>> Switching to Microsoft Office will probably solve your problem!
>> 
>> Sorry, I couldn't resist.
>> 
>> Are you using a really new or really old version of the ODT/ODS software?
>> I mean, maybe Tika doesn't have support for that version.
>> 
>> Check the mime type that Tika generates - maybe you just need to override
>> it to force Tika to use the proper format.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Sebastián Ramírez
>> Sent: Friday, May 10, 2013 11:24 AM
>> To: solr-user@lucene.apache.org
>> Subject: Tika not extracting content from ODT / ODS (open document /
>> libreoffice) in Solr 4.2.1
>> 
>> 
>> Hello everyone,
>> 
>> I'm having a problem indexing content from "opendocument format" files. The
>> files created with OpenOffice and LibreOffice (odt, ods...).
>> 
>> Tika is being able to read the files but Solr is not indexing the content.
>> 
>> It's not a problem of commiting or something like that, after I post a file
>> it is indexed and all the metadata is indexed/stored but the content isn't
>> there.
>> 
>> 
>>  - I modified the solrconfig.xml file to catch everything:
>> 
>> 
>> <requestHandler name="/update/extract"...
>> 
>>   <!-- here is the interesting part -->
>> 
>>   <!-- <str name="uprefix">ignored_</str> -->
>>   <str name="defaultField">all_txt</**str>
>> 
>> 
>> 
>>  - Then I submitted the file to Solr:
>> 
>> 
>> curl '
>> http://localhost:8983/solr/**update/extract?commit=true&**
>> literal.id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>'
>> -H
>> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
>> --data-binary @test_ods.ods
>> 
>> 
>> 
>>  - Now when I do a search in Solr I get this result, there is something
>> 
>>  in the "content", but that's not the actual content of the original file:
>> 
>> <result name="response" numFound="1" start="0">
>> <doc>
>>   <str name="id">newods</str>
>>   <arr name="all_txt">
>>     <str>1</str>
>>     <str>2013-05-03T10:02:10.58</**str>
>>     <str>2013-05-03T10:02:50.54</**str>
>>     <str>2013-05-03T10:02:50.54</**str>
>>     <str>1</str>
>>     <str>2013-05-03T10:02:10.58</**str>
>>     <str>1</str>
>>     <str>2013-05-03T10:02:50.54</**str>
>>     <str>2013-05-03T10:02:50.54</**str>
>>     <str>0</str>
>>     <str>P0D</str>
>>     <str>2013-05-03T10:02:10.58</**str>
>>     <str>1</str>
>>     <str>0</str>
>>     <str>application/ods</str>
>>     <str>0</str>
>>     <str>7322</str>
>>     <str>LibreOffice/4.0.2.2$**Windows_x86
>> LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3</str>
>>     <str>2013-05-03T10:02:50.54</**str>
>>   </arr>
>>   <date name="last_modified">2013-05-**03T10:02:50Z</date>
>>   <arr name="content_type">
>>     <str>application/vnd.oasis.**opendocument.spreadsheet</str>
>>   </arr>
>>   <arr name="content">
>>     <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /    </str>
>>   </arr>
>>   <long name="_version_">**1434658995848609792</long></**
>> doc></result></response>
>> 
>> 
>>  - I ask Solr to show me the extracted content from Tika doing this:
>> 
>> 
>> curl 
>> 'http://localhost:8983/solr/**update/extract?extractOnly=**true<http://localhost:8983/solr/update/extract?extractOnly=true>'
>> -H
>> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet'
>> --data-binary @test_ods.ods
>> 
>> 
>> 
>>  - And I get the XHTML extracted from Tika, including the original file
>> 
>>  contents and that final part that Solr is indeed indexing, so, Tika is
>>  being able to read the file but Solr is not indexing the real content, it
>>  only indexes the rest:
>> 
>> <body>
>> <table>
>> <tr>
>>   <td>
>>       <p>test</p>
>>   </td>
>> </tr>
>> <tr>
>>   <td>
>>       <p>de</p>
>>   </td>
>> </tr>
>> <tr>
>>   <td>
>>       <p>ods</p>
>>   </td>
>> </tr>
>> </table>
>> 
>> <p xmlns="http://www.w3.org/1999/**xhtml <http://www.w3.org/1999/xhtml>
>> ">???</p>
>> <p>Page</p>
>> <p>??? (???)</p>
>> <p>00/00/0000, 00:00:00</p>
>> <p>Page / </p>
>> </body>
>> 
>> Do any of you know how to fix/workaround this problem?
>> 
>> Thanks!
>> 
>> Sebastián Ramírez
>> 
>> --
>> *-----------------------------**-----------------------*
>> *This e-mail transmission, including any attachments, is intended only for
>> the named recipient(s) and may contain information that is privileged,
>> confidential and/or exempt from disclosure under applicable law. If you
>> have received this transmission in error, or are not the named
>> recipient(s), please notify Senseta immediately by return e-mail and
>> permanently delete this transmission, including any attachments.*
>> 
> 
> -- 
> *----------------------------------------------------*
> *This e-mail transmission, including any attachments, is intended only for 
> the named recipient(s) and may contain information that is privileged, 
> confidential and/or exempt from disclosure under applicable law. If you 
> have received this transmission in error, or are not the named 
> recipient(s), please notify Senseta immediately by return e-mail and 
> permanently delete this transmission, including any attachments.*

--
Walter Underwood
wun...@wunderwood.org

Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

Reply via email to