libreoffice) in Solr 4.2.1

Augusto Camarotti Wed, 04 Dec 2013 12:15:18 -0800

Hello everybody,
 
First of all, sorry about my bad english.
 
Giving updates on this bug, i maybe have found a solution for it.
I would like to have opinions on this solution.
I have found out that tika, when reading .odt files, would return more
than one document.
The first one for content.xml, which have the actual content of the
file, and the second one for styles.xml.
To test this, try to modify an .odt file removing styles.xml and solr
should parse its contents normally.
Solr, when receiving the second document (styles.xml), erases anything
it has read before. In general, styles.xml doesnt have any text on it,
so it receives just some spaces. 
I just modified a function inside 'SolrContentHandler.java' that erases
the content of the first document. I made this function to just add an
space, do not erase any previous content, so will always add up any
document tika is returning for solr.
I guess this behavior is going to work for previous cases, but i need
your opinion about this.
 
Here is the only modification i made on 'SolrContentHandler.java' 
 
  @Override
  public void startDocument() throws SAXException {
    document.clear();
    //catchAllBuilder.setLength(0);
    //Augusto Camarotti - 28-11-2013
    //As tika may parse more than one documents in one file, i have to
append every documento tika parses me,
    //so, i will only append a whitespace and wait for new content
everytime. Otherwise, Solr would just get the last document of the file
    catchAllBuilder.append(' ');
    for (StringBuilder builder : fieldBuilders.values()) {
      builder.setLength(0);
    }
    bldrStack.clear();
    bldrStack.add(catchAllBuilder);
  }
 
 
Regards, 
 
Augusto Camarotti


>>> Alexandre Rafalovitch <arafa...@gmail.com> 10/05/2013 21:13 >>>
I would try DIH with the flags as in jira issue I linked to. If
possible.
Just in case.

Regards,
    Alex
On 10 May 2013 19:53, "Sebastián Ramírez"
<sebastian.rami...@senseta.com>
wrote:

> OK Jack, I'll switch to MS Office ...hahaha
>
> Many thanks for your interest and help... and the bug report in
JIRA.
>
> Best,
>
> Sebastián Ramírez
>
>
> On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky
<j...@basetechnology.com
> >wrote:
>
> > I filed  SOLR-4809 - "OpenOffice document body is not indexed by
> > SolrCell", including some test files.
> >
> > https://issues.apache.org/**jira/browse/SOLR-4809<
> https://issues.apache.org/jira/browse/SOLR-4809>
> >
> > Yeah, at this stage, switching to Microsoft Office seems like the
best
> bet!
> >
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Sebastián Ramírez
> > Sent: Friday, May 10, 2013 6:33 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tika not extracting content from ODT / ODS (open
document /
> > libreoffice) in Solr 4.2.1
> >
> >
> > Many thanks Jack for your attention and effort on solving the
problem.
> >
> > Best,
> >
> > Sebastián Ramírez
> >
> >
> > On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky
<j...@basetechnology.com
> >*
> > *wrote:
> >
> >  I downloaded the latest Apache OpenOffice 3.4.1 and it does in
fact fail
> >> to index the proper content, both for .ODP and .ODT files.
> >>
> >> If I do extractOnly=true&****extractFormat=text, I see the
extracted
> text
> >>
> >> clearly in addition to the metadata.
> >>
> >> I tested on 4.3, and then tested on Solr 3.6.1 and it also
exhibited the
> >> problem. I just see spaces in both cases.
> >>
> >> But whether the problem is due to Solr or Tika, is not apparent.
> >>
> >> In any case, a Jira is warranted.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Sebastián Ramírez
> >> Sent: Friday, May 10, 2013 11:24 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Tika not extracting content from ODT / ODS (open document
/
> >> libreoffice) in Solr 4.2.1
> >>
> >> Hello everyone,
> >>
> >> I'm having a problem indexing content from "opendocument format"
files.
> >> The
> >> files created with OpenOffice and LibreOffice (odt, ods...).
> >>
> >> Tika is being able to read the files but Solr is not indexing the
> content.
> >>
> >> It's not a problem of commiting or something like that, after I
post a
> >> file
> >> it is indexed and all the metadata is indexed/stored but the
content
> isn't
> >> there.
> >>
> >>
> >>   - I modified the solrconfig.xml file to catch everything:
> >>
> >>
> >> <requestHandler name="/update/extract"...
> >>
> >>    <!-- here is the interesting part -->
> >>
> >>    <!-- <str name="uprefix">ignored_</str> -->
> >>    <str name="defaultField">all_txt</****str>
> >>
> >>
> >>
> >>
> >>   - Then I submitted the file to Solr:
> >>
> >>
> >> curl '
> >> http://localhost:8983/solr/****update/extract?commit=true&**<
> http://localhost:8983/solr/**update/extract?commit=true&**>
> >> literal.id=newods<http://**localhost:8983/solr/update/**
> >> extract?commit=true&literal.**id=newods<
>
http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>
> >> >'
> >> -H
> >> 'Content-type:
application/vnd.oasis.****opendocument.spreadsheet'
> >>
> >> --data-binary @test_ods.ods
> >>
> >>
> >>
> >>   - Now when I do a search in Solr I get this result, there is
something
> >>
> >>   in the "content", but that's not the actual content of the
original
> >> file:
> >>
> >> <result name="response" numFound="1" start="0">
> >>  <doc>
> >>    <str name="id">newods</str>
> >>    <arr name="all_txt">
> >>      <str>1</str>
> >>      <str>2013-05-03T10:02:10.58</****str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>      <str>1</str>
> >>      <str>2013-05-03T10:02:10.58</****str>
> >>      <str>1</str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>      <str>0</str>
> >>      <str>P0D</str>
> >>      <str>2013-05-03T10:02:10.58</****str>
> >>
> >>      <str>1</str>
> >>      <str>0</str>
> >>      <str>application/ods</str>
> >>      <str>0</str>
> >>      <str>7322</str>
> >>      <str>LibreOffice/4.0.2.2$****Windows_x86
> >>
>
LibreOffice_project/****4c82dcdd6efcd48b1d8bba66bfe198****9deee49c3</str>
> >>      <str>2013-05-03T10:02:50.54</****str>
> >>    </arr>
> >>    <date name="last_modified">2013-05-****03T10:02:50Z</date>
> >>    <arr name="content_type">
> >>     
<str>application/vnd.oasis.****opendocument.spreadsheet</str>
> >>
> >>    </arr>
> >>    <arr name="content">
> >>      <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /
>  </str>
> >>    </arr>
> >>    <long name="_version_">****1434658995848609792</long></**
> >>
> >> doc></result></response>
> >>
> >>
> >>   - I ask Solr to show me the extracted content from Tika doing
this:
> >>
> >>
> >> curl '
> http://localhost:8983/solr/****update/extract?extractOnly=****true<
> http://localhost:8983/solr/**update/extract?extractOnly=**true>
> >> <http://localhost:8983/**solr/update/extract?**extractOnly=true<
> http://localhost:8983/solr/update/extract?extractOnly=true>
> >> >'
> >> -H
> >> 'Content-type:
application/vnd.oasis.****opendocument.spreadsheet'
> >>
> >> --data-binary @test_ods.ods
> >>
> >>
> >>
> >>   - And I get the XHTML extracted from Tika, including the
original file
> >>
> >>   contents and that final part that Solr is indeed indexing, so,
Tika is
> >>   being able to read the file but Solr is not indexing the real
content,
> >> it
> >>   only indexes the rest:
> >>
> >> <body>
> >> <table>
> >> <tr>
> >>    <td>
> >>        <p>test</p>
> >>    </td>
> >> </tr>
> >> <tr>
> >>    <td>
> >>        <p>de</p>
> >>    </td>
> >> </tr>
> >> <tr>
> >>    <td>
> >>        <p>ods</p>
> >>    </td>
> >> </tr>
> >> </table>
> >>
> >> <p xmlns="http://www.w3.org/1999/****xhtml<
> http://www.w3.org/1999/**xhtml><
> >> http://www.w3.org/1999/xhtml>
> >>
> >> ">???</p>
> >> <p>Page</p>
> >> <p>??? (???)</p>
> >> <p>00/00/0000, 00:00:00</p>
> >> <p>Page / </p>
> >> </body>
> >>
> >> Do any of you know how to fix/workaround this problem?
> >>
> >> Thanks!
> >>
> >> Sebastián Ramírez
> >>
> >> --
> >> *-----------------------------****-----------------------*
> >>
> >> *This e-mail transmission, including any attachments, is intended
only
> for
> >> the named recipient(s) and may contain information that is
privileged,
> >> confidential and/or exempt from disclosure under applicable law.
If you
> >> have received this transmission in error, or are not the named
> >> recipient(s), please notify Senseta immediately by return e-mail
and
> >> permanently delete this transmission, including any attachments.*
> >>
> >>
> > --
> > *-----------------------------**-----------------------*
> > *This e-mail transmission, including any attachments, is intended
only
> for
> > the named recipient(s) and may contain information that is
privileged,
> > confidential and/or exempt from disclosure under applicable law. If
you
> > have received this transmission in error, or are not the named
> > recipient(s), please notify Senseta immediately by return e-mail
and
> > permanently delete this transmission, including any attachments.*
> >
>
> --
> *----------------------------------------------------*
> *This e-mail transmission, including any attachments, is intended
only for
> the named recipient(s) and may contain information that is
privileged,
> confidential and/or exempt from disclosure under applicable law. If
you
> have received this transmission in error, or are not the named
> recipient(s), please notify Senseta immediately by return e-mail and
> permanently delete this transmission, including any attachments.*
>

Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

Reply via email to