I would try DIH with the flags as in jira issue I linked to. If possible. Just in case.
Regards, Alex On 10 May 2013 19:53, "Sebastián Ramírez" <sebastian.rami...@senseta.com> wrote: > OK Jack, I'll switch to MS Office ...hahaha > > Many thanks for your interest and help... and the bug report in JIRA. > > Best, > > Sebastián Ramírez > > > On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky <j...@basetechnology.com > >wrote: > > > I filed SOLR-4809 - "OpenOffice document body is not indexed by > > SolrCell", including some test files. > > > > https://issues.apache.org/**jira/browse/SOLR-4809< > https://issues.apache.org/jira/browse/SOLR-4809> > > > > Yeah, at this stage, switching to Microsoft Office seems like the best > bet! > > > > > > -- Jack Krupansky > > > > -----Original Message----- From: Sebastián Ramírez > > Sent: Friday, May 10, 2013 6:33 PM > > To: solr-user@lucene.apache.org > > Subject: Re: Tika not extracting content from ODT / ODS (open document / > > libreoffice) in Solr 4.2.1 > > > > > > Many thanks Jack for your attention and effort on solving the problem. > > > > Best, > > > > Sebastián Ramírez > > > > > > On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky <j...@basetechnology.com > >* > > *wrote: > > > > I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail > >> to index the proper content, both for .ODP and .ODT files. > >> > >> If I do extractOnly=true&****extractFormat=text, I see the extracted > text > >> > >> clearly in addition to the metadata. > >> > >> I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the > >> problem. I just see spaces in both cases. > >> > >> But whether the problem is due to Solr or Tika, is not apparent. > >> > >> In any case, a Jira is warranted. > >> > >> > >> -- Jack Krupansky > >> > >> -----Original Message----- From: Sebastián Ramírez > >> Sent: Friday, May 10, 2013 11:24 AM > >> To: solr-user@lucene.apache.org > >> Subject: Tika not extracting content from ODT / ODS (open document / > >> libreoffice) in Solr 4.2.1 > >> > >> Hello everyone, > >> > >> I'm having a problem indexing content from "opendocument format" files. > >> The > >> files created with OpenOffice and LibreOffice (odt, ods...). > >> > >> Tika is being able to read the files but Solr is not indexing the > content. > >> > >> It's not a problem of commiting or something like that, after I post a > >> file > >> it is indexed and all the metadata is indexed/stored but the content > isn't > >> there. > >> > >> > >> - I modified the solrconfig.xml file to catch everything: > >> > >> > >> <requestHandler name="/update/extract"... > >> > >> <!-- here is the interesting part --> > >> > >> <!-- <str name="uprefix">ignored_</str> --> > >> <str name="defaultField">all_txt</****str> > >> > >> > >> > >> > >> - Then I submitted the file to Solr: > >> > >> > >> curl ' > >> http://localhost:8983/solr/****update/extract?commit=true&**< > http://localhost:8983/solr/**update/extract?commit=true&**> > >> literal.id=newods<http://**localhost:8983/solr/update/** > >> extract?commit=true&literal.**id=newods< > http://localhost:8983/solr/update/extract?commit=true&literal.id=newods> > >> >' > >> -H > >> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet' > >> > >> --data-binary @test_ods.ods > >> > >> > >> > >> - Now when I do a search in Solr I get this result, there is something > >> > >> in the "content", but that's not the actual content of the original > >> file: > >> > >> <result name="response" numFound="1" start="0"> > >> <doc> > >> <str name="id">newods</str> > >> <arr name="all_txt"> > >> <str>1</str> > >> <str>2013-05-03T10:02:10.58</****str> > >> <str>2013-05-03T10:02:50.54</****str> > >> <str>2013-05-03T10:02:50.54</****str> > >> <str>1</str> > >> <str>2013-05-03T10:02:10.58</****str> > >> <str>1</str> > >> <str>2013-05-03T10:02:50.54</****str> > >> > >> <str>2013-05-03T10:02:50.54</****str> > >> <str>0</str> > >> <str>P0D</str> > >> <str>2013-05-03T10:02:10.58</****str> > >> > >> <str>1</str> > >> <str>0</str> > >> <str>application/ods</str> > >> <str>0</str> > >> <str>7322</str> > >> <str>LibreOffice/4.0.2.2$****Windows_x86 > >> > LibreOffice_project/****4c82dcdd6efcd48b1d8bba66bfe198****9deee49c3</str> > >> <str>2013-05-03T10:02:50.54</****str> > >> </arr> > >> <date name="last_modified">2013-05-****03T10:02:50Z</date> > >> <arr name="content_type"> > >> <str>application/vnd.oasis.****opendocument.spreadsheet</str> > >> > >> </arr> > >> <arr name="content"> > >> <str> ??? Page ??? (???) 00/00/0000, 00:00:00 Page / > </str> > >> </arr> > >> <long name="_version_">****1434658995848609792</long></** > >> > >> doc></result></response> > >> > >> > >> - I ask Solr to show me the extracted content from Tika doing this: > >> > >> > >> curl ' > http://localhost:8983/solr/****update/extract?extractOnly=****true< > http://localhost:8983/solr/**update/extract?extractOnly=**true> > >> <http://localhost:8983/**solr/update/extract?**extractOnly=true< > http://localhost:8983/solr/update/extract?extractOnly=true> > >> >' > >> -H > >> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet' > >> > >> --data-binary @test_ods.ods > >> > >> > >> > >> - And I get the XHTML extracted from Tika, including the original file > >> > >> contents and that final part that Solr is indeed indexing, so, Tika is > >> being able to read the file but Solr is not indexing the real content, > >> it > >> only indexes the rest: > >> > >> <body> > >> <table> > >> <tr> > >> <td> > >> <p>test</p> > >> </td> > >> </tr> > >> <tr> > >> <td> > >> <p>de</p> > >> </td> > >> </tr> > >> <tr> > >> <td> > >> <p>ods</p> > >> </td> > >> </tr> > >> </table> > >> > >> <p xmlns="http://www.w3.org/1999/****xhtml< > http://www.w3.org/1999/**xhtml>< > >> http://www.w3.org/1999/xhtml> > >> > >> ">???</p> > >> <p>Page</p> > >> <p>??? (???)</p> > >> <p>00/00/0000, 00:00:00</p> > >> <p>Page / </p> > >> </body> > >> > >> Do any of you know how to fix/workaround this problem? > >> > >> Thanks! > >> > >> Sebastián Ramírez > >> > >> -- > >> *-----------------------------****-----------------------* > >> > >> *This e-mail transmission, including any attachments, is intended only > for > >> the named recipient(s) and may contain information that is privileged, > >> confidential and/or exempt from disclosure under applicable law. If you > >> have received this transmission in error, or are not the named > >> recipient(s), please notify Senseta immediately by return e-mail and > >> permanently delete this transmission, including any attachments.* > >> > >> > > -- > > *-----------------------------**-----------------------* > > *This e-mail transmission, including any attachments, is intended only > for > > the named recipient(s) and may contain information that is privileged, > > confidential and/or exempt from disclosure under applicable law. If you > > have received this transmission in error, or are not the named > > recipient(s), please notify Senseta immediately by return e-mail and > > permanently delete this transmission, including any attachments.* > > > > -- > *----------------------------------------------------* > *This e-mail transmission, including any attachments, is intended only for > the named recipient(s) and may contain information that is privileged, > confidential and/or exempt from disclosure under applicable law. If you > have received this transmission in error, or are not the named > recipient(s), please notify Senseta immediately by return e-mail and > permanently delete this transmission, including any attachments.* >