The last time I looked at those formats, they were a zip archive with the content in an XML file. I think it was a an obvious name, like "content.xml".
So you should be able to extract that and look at it. Opening XML in a browser can be helpful, because it will flag any parse errors. wunder On May 10, 2013, at 11:34 AM, Sebastián Ramírez wrote: > Thanks for your reply Jack! > > First: LOL > > Second: I'm using the latest version of libreoffice, but with the > "extractOnly" param in the Solr request it shows the content of the file so > Tika is being able to read and extract the data but Solr isn't indexing > that data. > > Third: I already did that with no luck, I tried > "application/vnd.oasis.opendocument.spreadsheet", "application/ods" and > "application/octet-stream" but always got the same result. > > Following the documentation for > "ExtractingRequestHandler<http://wiki.apache.org/solr/ExtractingRequestHandler#Concepts>" > I see that Tika reads the file and feeds it to a "SAX ContentHandler", and > "Solr then reacts to Tika's SAX events and creates the fields to index". I > think that the problem might be somewhere in that process of feeding the > "SAX ContentHandler" or the reaction of Solr to those "SAX events". > > Do you (or anyone else) know how could one configure / debug that "SAX > ContentHandler"? > > > Thanks, > > Sebastián Ramírez > > > > On Fri, May 10, 2013 at 10:57 AM, Jack Krupansky > <j...@basetechnology.com>wrote: > >> Switching to Microsoft Office will probably solve your problem! >> >> Sorry, I couldn't resist. >> >> Are you using a really new or really old version of the ODT/ODS software? >> I mean, maybe Tika doesn't have support for that version. >> >> Check the mime type that Tika generates - maybe you just need to override >> it to force Tika to use the proper format. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Sebastián Ramírez >> Sent: Friday, May 10, 2013 11:24 AM >> To: solr-user@lucene.apache.org >> Subject: Tika not extracting content from ODT / ODS (open document / >> libreoffice) in Solr 4.2.1 >> >> >> Hello everyone, >> >> I'm having a problem indexing content from "opendocument format" files. The >> files created with OpenOffice and LibreOffice (odt, ods...). >> >> Tika is being able to read the files but Solr is not indexing the content. >> >> It's not a problem of commiting or something like that, after I post a file >> it is indexed and all the metadata is indexed/stored but the content isn't >> there. >> >> >> - I modified the solrconfig.xml file to catch everything: >> >> >> <requestHandler name="/update/extract"... >> >> <!-- here is the interesting part --> >> >> <!-- <str name="uprefix">ignored_</str> --> >> <str name="defaultField">all_txt</**str> >> >> >> >> - Then I submitted the file to Solr: >> >> >> curl ' >> http://localhost:8983/solr/**update/extract?commit=true&** >> literal.id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>' >> -H >> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' >> --data-binary @test_ods.ods >> >> >> >> - Now when I do a search in Solr I get this result, there is something >> >> in the "content", but that's not the actual content of the original file: >> >> <result name="response" numFound="1" start="0"> >> <doc> >> <str name="id">newods</str> >> <arr name="all_txt"> >> <str>1</str> >> <str>2013-05-03T10:02:10.58</**str> >> <str>2013-05-03T10:02:50.54</**str> >> <str>2013-05-03T10:02:50.54</**str> >> <str>1</str> >> <str>2013-05-03T10:02:10.58</**str> >> <str>1</str> >> <str>2013-05-03T10:02:50.54</**str> >> <str>2013-05-03T10:02:50.54</**str> >> <str>0</str> >> <str>P0D</str> >> <str>2013-05-03T10:02:10.58</**str> >> <str>1</str> >> <str>0</str> >> <str>application/ods</str> >> <str>0</str> >> <str>7322</str> >> <str>LibreOffice/4.0.2.2$**Windows_x86 >> LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3</str> >> <str>2013-05-03T10:02:50.54</**str> >> </arr> >> <date name="last_modified">2013-05-**03T10:02:50Z</date> >> <arr name="content_type"> >> <str>application/vnd.oasis.**opendocument.spreadsheet</str> >> </arr> >> <arr name="content"> >> <str> ??? Page ??? (???) 00/00/0000, 00:00:00 Page / </str> >> </arr> >> <long name="_version_">**1434658995848609792</long></** >> doc></result></response> >> >> >> - I ask Solr to show me the extracted content from Tika doing this: >> >> >> curl >> 'http://localhost:8983/solr/**update/extract?extractOnly=**true<http://localhost:8983/solr/update/extract?extractOnly=true>' >> -H >> 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' >> --data-binary @test_ods.ods >> >> >> >> - And I get the XHTML extracted from Tika, including the original file >> >> contents and that final part that Solr is indeed indexing, so, Tika is >> being able to read the file but Solr is not indexing the real content, it >> only indexes the rest: >> >> <body> >> <table> >> <tr> >> <td> >> <p>test</p> >> </td> >> </tr> >> <tr> >> <td> >> <p>de</p> >> </td> >> </tr> >> <tr> >> <td> >> <p>ods</p> >> </td> >> </tr> >> </table> >> >> <p xmlns="http://www.w3.org/1999/**xhtml <http://www.w3.org/1999/xhtml> >> ">???</p> >> <p>Page</p> >> <p>??? (???)</p> >> <p>00/00/0000, 00:00:00</p> >> <p>Page / </p> >> </body> >> >> Do any of you know how to fix/workaround this problem? >> >> Thanks! >> >> Sebastián Ramírez >> >> -- >> *-----------------------------**-----------------------* >> *This e-mail transmission, including any attachments, is intended only for >> the named recipient(s) and may contain information that is privileged, >> confidential and/or exempt from disclosure under applicable law. If you >> have received this transmission in error, or are not the named >> recipient(s), please notify Senseta immediately by return e-mail and >> permanently delete this transmission, including any attachments.* >> > > -- > *----------------------------------------------------* > *This e-mail transmission, including any attachments, is intended only for > the named recipient(s) and may contain information that is privileged, > confidential and/or exempt from disclosure under applicable law. If you > have received this transmission in error, or are not the named > recipient(s), please notify Senseta immediately by return e-mail and > permanently delete this transmission, including any attachments.* -- Walter Underwood wun...@wunderwood.org