libreoffice) in Solr 4.2.1

Alexandre Rafalovitch Fri, 10 May 2013 17:13:36 -0700

I would try DIH with the flags as in jira issue I linked to. If possible.
Just in case.


Regards,
    Alex
 On 10 May 2013 19:53, "Sebastián Ramírez" <sebastian.rami...@senseta.com>
wrote:

> OK Jack, I'll switch to MS Office ...hahaha
>
> Many thanks for your interest and help... and the bug report in JIRA.
>
> Best,
>
> Sebastián Ramírez
>
>
> On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky <j...@basetechnology.com
> >wrote:
>
> > I filed  SOLR-4809 - "OpenOffice document body is not indexed by
> > SolrCell", including some test files.
> >
> > https://issues.apache.org/**jira/browse/SOLR-4809<
> https://issues.apache.org/jira/browse/SOLR-4809>
> >
> > Yeah, at this stage, switching to Microsoft Office seems like the best
> bet!
> >
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Sebastián Ramírez
> > Sent: Friday, May 10, 2013 6:33 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tika not extracting content from ODT / ODS (open document /
> > libreoffice) in Solr 4.2.1
> >
> >
> > Many thanks Jack for your attention and effort on solving the problem.
> >
> > Best,
> >
> > Sebastián Ramírez
> >
> >
> > On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky <j...@basetechnology.com
> >*
> > *wrote:
> >
> >  I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail
> >> to index the proper content, both for .ODP and .ODT files.
> >>
> >> If I do extractOnly=true&****extractFormat=text, I see the extracted
> text
> >>
> >> clearly in addition to the metadata.
> >>
> >> I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the
> >> problem. I just see spaces in both cases.
> >>
> >> But whether the problem is due to Solr or Tika, is not apparent.
> >>
> >> In any case, a Jira is warranted.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Sebastián Ramírez
> >> Sent: Friday, May 10, 2013 11:24 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Tika not extracting content from ODT / ODS (open document /
> >> libreoffice) in Solr 4.2.1
> >>
> >> Hello everyone,
> >>
> >> I'm having a problem indexing content from "opendocument format" files.
> >> The
> >> files created with OpenOffice and LibreOffice (odt, ods...).
> >>
> >> Tika is being able to read the files but Solr is not indexing the
> content.
> >>
> >> It's not a problem of commiting or something like that, after I post a
> >> file
> >> it is indexed and all the metadata is indexed/stored but the content
> isn't
> >> there.
> >>
> >>
> >>   - I modified the solrconfig.xml file to catch everything:
> >>
> >>
> >> <requestHandler name="/update/extract"...
> >>
> >>    <!-- here is the interesting part -->
> >>
> >>    <!-- <str name="uprefix">ignored_</str> -->
> >>    <str name="defaultField">all_txt</****str>
> >>
> >>
> >>
> >>
> >>   - Then I submitted the file to Solr:
> >>
> >>
> >> curl '
> >> http://localhost:8983/solr/****update/extract?commit=true&**<
> http://localhost:8983/solr/**update/extract?commit=true&**>
> >> literal.id=newods<http://**localhost:8983/solr/update/**
> >> extract?commit=true&literal.**id=newods<
> http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>
> >> >'
> >> -H
> >> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet'
> >>
> >> --data-binary @test_ods.ods
> >>
> >>
> >>
> >>   - Now when I do a search in Solr I get this result, there is something
> >>
> >>   in the "content", but that's not the actual content of the original
> >> file:
> >>
> >> <result name="response" numFound="1" start="0">
> >>  <doc>
> >>    <str name="id">newods</str>
> >>    <arr name="all_txt">
> >>      <str>1</str>
> >>      <str>2013-05-03T10:02:10.58</****str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>      <str>1</str>
> >>      <str>2013-05-03T10:02:10.58</****str>
> >>      <str>1</str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>      <str>0</str>
> >>      <str>P0D</str>
> >>      <str>2013-05-03T10:02:10.58</****str>
> >>
> >>      <str>1</str>
> >>      <str>0</str>
> >>      <str>application/ods</str>
> >>      <str>0</str>
> >>      <str>7322</str>
> >>      <str>LibreOffice/4.0.2.2$****Windows_x86
> >>
> LibreOffice_project/****4c82dcdd6efcd48b1d8bba66bfe198****9deee49c3</str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>    </arr>
> >>    <date name="last_modified">2013-05-****03T10:02:50Z</date>
> >>    <arr name="content_type">
> >>      <str>application/vnd.oasis.****opendocument.spreadsheet</str>
> >>
> >>    </arr>
> >>    <arr name="content">
> >>      <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /
>  </str>
> >>    </arr>
> >>    <long name="_version_">****1434658995848609792</long></**
> >>
> >> doc></result></response>
> >>
> >>
> >>   - I ask Solr to show me the extracted content from Tika doing this:
> >>
> >>
> >> curl '
> http://localhost:8983/solr/****update/extract?extractOnly=****true<
> http://localhost:8983/solr/**update/extract?extractOnly=**true>
> >> <http://localhost:8983/**solr/update/extract?**extractOnly=true<
> http://localhost:8983/solr/update/extract?extractOnly=true>
> >> >'
> >> -H
> >> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet'
> >>
> >> --data-binary @test_ods.ods
> >>
> >>
> >>
> >>   - And I get the XHTML extracted from Tika, including the original file
> >>
> >>   contents and that final part that Solr is indeed indexing, so, Tika is
> >>   being able to read the file but Solr is not indexing the real content,
> >> it
> >>   only indexes the rest:
> >>
> >> <body>
> >> <table>
> >> <tr>
> >>    <td>
> >>        <p>test</p>
> >>    </td>
> >> </tr>
> >> <tr>
> >>    <td>
> >>        <p>de</p>
> >>    </td>
> >> </tr>
> >> <tr>
> >>    <td>
> >>        <p>ods</p>
> >>    </td>
> >> </tr>
> >> </table>
> >>
> >> <p xmlns="http://www.w3.org/1999/****xhtml<
> http://www.w3.org/1999/**xhtml><
> >> http://www.w3.org/1999/xhtml>
> >>
> >> ">???</p>
> >> <p>Page</p>
> >> <p>??? (???)</p>
> >> <p>00/00/0000, 00:00:00</p>
> >> <p>Page / </p>
> >> </body>
> >>
> >> Do any of you know how to fix/workaround this problem?
> >>
> >> Thanks!
> >>
> >> Sebastián Ramírez
> >>
> >> --
> >> *-----------------------------****-----------------------*
> >>
> >> *This e-mail transmission, including any attachments, is intended only
> for
> >> the named recipient(s) and may contain information that is privileged,
> >> confidential and/or exempt from disclosure under applicable law. If you
> >> have received this transmission in error, or are not the named
> >> recipient(s), please notify Senseta immediately by return e-mail and
> >> permanently delete this transmission, including any attachments.*
> >>
> >>
> > --
> > *-----------------------------**-----------------------*
> > *This e-mail transmission, including any attachments, is intended only
> for
> > the named recipient(s) and may contain information that is privileged,
> > confidential and/or exempt from disclosure under applicable law. If you
> > have received this transmission in error, or are not the named
> > recipient(s), please notify Senseta immediately by return e-mail and
> > permanently delete this transmission, including any attachments.*
> >
>
> --
> *----------------------------------------------------*
> *This e-mail transmission, including any attachments, is intended only for
> the named recipient(s) and may contain information that is privileged,
> confidential and/or exempt from disclosure under applicable law. If you
> have received this transmission in error, or are not the named
> recipient(s), please notify Senseta immediately by return e-mail and
> permanently delete this transmission, including any attachments.*
>

Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

Reply via email to