OK Jack, I'll switch to MS Office ...hahaha Many thanks for your interest and help... and the bug report in JIRA.
Best, Sebastián Ramírez On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky <j...@basetechnology.com>wrote: > I filed SOLR-4809 - "OpenOffice document body is not indexed by > SolrCell", including some test files. > > https://issues.apache.org/**jira/browse/SOLR-4809<https://issues.apache.org/jira/browse/SOLR-4809> > > Yeah, at this stage, switching to Microsoft Office seems like the best bet! > > > -- Jack Krupansky > > -----Original Message----- From: Sebastián Ramírez > Sent: Friday, May 10, 2013 6:33 PM > To: solr-user@lucene.apache.org > Subject: Re: Tika not extracting content from ODT / ODS (open document / > libreoffice) in Solr 4.2.1 > > > Many thanks Jack for your attention and effort on solving the problem. > > Best, > > Sebastián Ramírez > > > On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky <j...@basetechnology.com>* > *wrote: > > I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail >> to index the proper content, both for .ODP and .ODT files. >> >> If I do extractOnly=true&****extractFormat=text, I see the extracted text >> >> clearly in addition to the metadata. >> >> I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the >> problem. I just see spaces in both cases. >> >> But whether the problem is due to Solr or Tika, is not apparent. >> >> In any case, a Jira is warranted. >> >> >> -- Jack Krupansky >> >> -----Original Message----- From: Sebastián Ramírez >> Sent: Friday, May 10, 2013 11:24 AM >> To: solr-user@lucene.apache.org >> Subject: Tika not extracting content from ODT / ODS (open document / >> libreoffice) in Solr 4.2.1 >> >> Hello everyone, >> >> I'm having a problem indexing content from "opendocument format" files. >> The >> files created with OpenOffice and LibreOffice (odt, ods...). >> >> Tika is being able to read the files but Solr is not indexing the content. >> >> It's not a problem of commiting or something like that, after I post a >> file >> it is indexed and all the metadata is indexed/stored but the content isn't >> there. >> >> >> - I modified the solrconfig.xml file to catch everything: >> >> >> <requestHandler name="/update/extract"... >> >> <!-- here is the interesting part --> >> >> <!-- <str name="uprefix">ignored_</str> --> >> <str name="defaultField">all_txt</****str> >> >> >> >> >> - Then I submitted the file to Solr: >> >> >> curl ' >> http://localhost:8983/solr/****update/extract?commit=true&**<http://localhost:8983/solr/**update/extract?commit=true&**> >> literal.id=newods<http://**localhost:8983/solr/update/** >> extract?commit=true&literal.**id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods> >> >' >> -H >> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet' >> >> --data-binary @test_ods.ods >> >> >> >> - Now when I do a search in Solr I get this result, there is something >> >> in the "content", but that's not the actual content of the original >> file: >> >> <result name="response" numFound="1" start="0"> >> <doc> >> <str name="id">newods</str> >> <arr name="all_txt"> >> <str>1</str> >> <str>2013-05-03T10:02:10.58</****str> >> <str>2013-05-03T10:02:50.54</****str> >> <str>2013-05-03T10:02:50.54</****str> >> <str>1</str> >> <str>2013-05-03T10:02:10.58</****str> >> <str>1</str> >> <str>2013-05-03T10:02:50.54</****str> >> >> <str>2013-05-03T10:02:50.54</****str> >> <str>0</str> >> <str>P0D</str> >> <str>2013-05-03T10:02:10.58</****str> >> >> <str>1</str> >> <str>0</str> >> <str>application/ods</str> >> <str>0</str> >> <str>7322</str> >> <str>LibreOffice/4.0.2.2$****Windows_x86 >> LibreOffice_project/****4c82dcdd6efcd48b1d8bba66bfe198****9deee49c3</str> >> <str>2013-05-03T10:02:50.54</****str> >> </arr> >> <date name="last_modified">2013-05-****03T10:02:50Z</date> >> <arr name="content_type"> >> <str>application/vnd.oasis.****opendocument.spreadsheet</str> >> >> </arr> >> <arr name="content"> >> <str> ??? Page ??? (???) 00/00/0000, 00:00:00 Page / </str> >> </arr> >> <long name="_version_">****1434658995848609792</long></** >> >> doc></result></response> >> >> >> - I ask Solr to show me the extracted content from Tika doing this: >> >> >> curl >> 'http://localhost:8983/solr/****update/extract?extractOnly=****true<http://localhost:8983/solr/**update/extract?extractOnly=**true> >> <http://localhost:8983/**solr/update/extract?**extractOnly=true<http://localhost:8983/solr/update/extract?extractOnly=true> >> >' >> -H >> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet' >> >> --data-binary @test_ods.ods >> >> >> >> - And I get the XHTML extracted from Tika, including the original file >> >> contents and that final part that Solr is indeed indexing, so, Tika is >> being able to read the file but Solr is not indexing the real content, >> it >> only indexes the rest: >> >> <body> >> <table> >> <tr> >> <td> >> <p>test</p> >> </td> >> </tr> >> <tr> >> <td> >> <p>de</p> >> </td> >> </tr> >> <tr> >> <td> >> <p>ods</p> >> </td> >> </tr> >> </table> >> >> <p xmlns="http://www.w3.org/1999/****xhtml<http://www.w3.org/1999/**xhtml>< >> http://www.w3.org/1999/xhtml> >> >> ">???</p> >> <p>Page</p> >> <p>??? (???)</p> >> <p>00/00/0000, 00:00:00</p> >> <p>Page / </p> >> </body> >> >> Do any of you know how to fix/workaround this problem? >> >> Thanks! >> >> Sebastián Ramírez >> >> -- >> *-----------------------------****-----------------------* >> >> *This e-mail transmission, including any attachments, is intended only for >> the named recipient(s) and may contain information that is privileged, >> confidential and/or exempt from disclosure under applicable law. If you >> have received this transmission in error, or are not the named >> recipient(s), please notify Senseta immediately by return e-mail and >> permanently delete this transmission, including any attachments.* >> >> > -- > *-----------------------------**-----------------------* > *This e-mail transmission, including any attachments, is intended only for > the named recipient(s) and may contain information that is privileged, > confidential and/or exempt from disclosure under applicable law. If you > have received this transmission in error, or are not the named > recipient(s), please notify Senseta immediately by return e-mail and > permanently delete this transmission, including any attachments.* > -- *----------------------------------------------------* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*