Many thanks Jack for your attention and effort on solving the problem. Best,
Sebastián Ramírez On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky <j...@basetechnology.com>wrote: > I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail > to index the proper content, both for .ODP and .ODT files. > > If I do extractOnly=true&**extractFormat=text, I see the extracted text > clearly in addition to the metadata. > > I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the > problem. I just see spaces in both cases. > > But whether the problem is due to Solr or Tika, is not apparent. > > In any case, a Jira is warranted. > > > -- Jack Krupansky > > -----Original Message----- From: Sebastián Ramírez > Sent: Friday, May 10, 2013 11:24 AM > To: solr-user@lucene.apache.org > Subject: Tika not extracting content from ODT / ODS (open document / > libreoffice) in Solr 4.2.1 > > Hello everyone, > > I'm having a problem indexing content from "opendocument format" files. The > files created with OpenOffice and LibreOffice (odt, ods...). > > Tika is being able to read the files but Solr is not indexing the content. > > It's not a problem of commiting or something like that, after I post a file > it is indexed and all the metadata is indexed/stored but the content isn't > there. > > > - I modified the solrconfig.xml file to catch everything: > > > <requestHandler name="/update/extract"... > > <!-- here is the interesting part --> > > <!-- <str name="uprefix">ignored_</str> --> > <str name="defaultField">all_txt</**str> > > > > - Then I submitted the file to Solr: > > > curl ' > http://localhost:8983/solr/**update/extract?commit=true&** > literal.id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>' > -H > 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' > --data-binary @test_ods.ods > > > > - Now when I do a search in Solr I get this result, there is something > > in the "content", but that's not the actual content of the original file: > > <result name="response" numFound="1" start="0"> > <doc> > <str name="id">newods</str> > <arr name="all_txt"> > <str>1</str> > <str>2013-05-03T10:02:10.58</**str> > <str>2013-05-03T10:02:50.54</**str> > <str>2013-05-03T10:02:50.54</**str> > <str>1</str> > <str>2013-05-03T10:02:10.58</**str> > <str>1</str> > <str>2013-05-03T10:02:50.54</**str> > <str>2013-05-03T10:02:50.54</**str> > <str>0</str> > <str>P0D</str> > <str>2013-05-03T10:02:10.58</**str> > <str>1</str> > <str>0</str> > <str>application/ods</str> > <str>0</str> > <str>7322</str> > <str>LibreOffice/4.0.2.2$**Windows_x86 > LibreOffice_project/**4c82dcdd6efcd48b1d8bba66bfe198**9deee49c3</str> > <str>2013-05-03T10:02:50.54</**str> > </arr> > <date name="last_modified">2013-05-**03T10:02:50Z</date> > <arr name="content_type"> > <str>application/vnd.oasis.**opendocument.spreadsheet</str> > </arr> > <arr name="content"> > <str> ??? Page ??? (???) 00/00/0000, 00:00:00 Page / </str> > </arr> > <long name="_version_">**1434658995848609792</long></** > doc></result></response> > > > - I ask Solr to show me the extracted content from Tika doing this: > > > curl > 'http://localhost:8983/solr/**update/extract?extractOnly=**true<http://localhost:8983/solr/update/extract?extractOnly=true>' > -H > 'Content-type: application/vnd.oasis.**opendocument.spreadsheet' > --data-binary @test_ods.ods > > > > - And I get the XHTML extracted from Tika, including the original file > > contents and that final part that Solr is indeed indexing, so, Tika is > being able to read the file but Solr is not indexing the real content, it > only indexes the rest: > > <body> > <table> > <tr> > <td> > <p>test</p> > </td> > </tr> > <tr> > <td> > <p>de</p> > </td> > </tr> > <tr> > <td> > <p>ods</p> > </td> > </tr> > </table> > > <p xmlns="http://www.w3.org/1999/**xhtml <http://www.w3.org/1999/xhtml> > ">???</p> > <p>Page</p> > <p>??? (???)</p> > <p>00/00/0000, 00:00:00</p> > <p>Page / </p> > </body> > > Do any of you know how to fix/workaround this problem? > > Thanks! > > Sebastián Ramírez > > -- > *-----------------------------**-----------------------* > *This e-mail transmission, including any attachments, is intended only for > the named recipient(s) and may contain information that is privileged, > confidential and/or exempt from disclosure under applicable law. If you > have received this transmission in error, or are not the named > recipient(s), please notify Senseta immediately by return e-mail and > permanently delete this transmission, including any attachments.* > -- *----------------------------------------------------* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*