OK Jack, I'll switch to MS Office ...hahaha

Many thanks for your interest and help... and the bug report in JIRA.

Best,

Sebastián Ramírez


On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> I filed  SOLR-4809 - "OpenOffice document body is not indexed by
> SolrCell", including some test files.
>
> https://issues.apache.org/**jira/browse/SOLR-4809<https://issues.apache.org/jira/browse/SOLR-4809>
>
> Yeah, at this stage, switching to Microsoft Office seems like the best bet!
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Sebastián Ramírez
> Sent: Friday, May 10, 2013 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Tika not extracting content from ODT / ODS (open document /
> libreoffice) in Solr 4.2.1
>
>
> Many thanks Jack for your attention and effort on solving the problem.
>
> Best,
>
> Sebastián Ramírez
>
>
> On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky <j...@basetechnology.com>*
> *wrote:
>
>  I downloaded the latest Apache OpenOffice 3.4.1 and it does in fact fail
>> to index the proper content, both for .ODP and .ODT files.
>>
>> If I do extractOnly=true&****extractFormat=text, I see the extracted text
>>
>> clearly in addition to the metadata.
>>
>> I tested on 4.3, and then tested on Solr 3.6.1 and it also exhibited the
>> problem. I just see spaces in both cases.
>>
>> But whether the problem is due to Solr or Tika, is not apparent.
>>
>> In any case, a Jira is warranted.
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Sebastián Ramírez
>> Sent: Friday, May 10, 2013 11:24 AM
>> To: solr-user@lucene.apache.org
>> Subject: Tika not extracting content from ODT / ODS (open document /
>> libreoffice) in Solr 4.2.1
>>
>> Hello everyone,
>>
>> I'm having a problem indexing content from "opendocument format" files.
>> The
>> files created with OpenOffice and LibreOffice (odt, ods...).
>>
>> Tika is being able to read the files but Solr is not indexing the content.
>>
>> It's not a problem of commiting or something like that, after I post a
>> file
>> it is indexed and all the metadata is indexed/stored but the content isn't
>> there.
>>
>>
>>   - I modified the solrconfig.xml file to catch everything:
>>
>>
>> <requestHandler name="/update/extract"...
>>
>>    <!-- here is the interesting part -->
>>
>>    <!-- <str name="uprefix">ignored_</str> -->
>>    <str name="defaultField">all_txt</****str>
>>
>>
>>
>>
>>   - Then I submitted the file to Solr:
>>
>>
>> curl '
>> http://localhost:8983/solr/****update/extract?commit=true&**<http://localhost:8983/solr/**update/extract?commit=true&**>
>> literal.id=newods<http://**localhost:8983/solr/update/**
>> extract?commit=true&literal.**id=newods<http://localhost:8983/solr/update/extract?commit=true&literal.id=newods>
>> >'
>> -H
>> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet'
>>
>> --data-binary @test_ods.ods
>>
>>
>>
>>   - Now when I do a search in Solr I get this result, there is something
>>
>>   in the "content", but that's not the actual content of the original
>> file:
>>
>> <result name="response" numFound="1" start="0">
>>  <doc>
>>    <str name="id">newods</str>
>>    <arr name="all_txt">
>>      <str>1</str>
>>      <str>2013-05-03T10:02:10.58</****str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>      <str>1</str>
>>      <str>2013-05-03T10:02:10.58</****str>
>>      <str>1</str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>
>>      <str>2013-05-03T10:02:50.54</****str>
>>      <str>0</str>
>>      <str>P0D</str>
>>      <str>2013-05-03T10:02:10.58</****str>
>>
>>      <str>1</str>
>>      <str>0</str>
>>      <str>application/ods</str>
>>      <str>0</str>
>>      <str>7322</str>
>>      <str>LibreOffice/4.0.2.2$****Windows_x86
>> LibreOffice_project/****4c82dcdd6efcd48b1d8bba66bfe198****9deee49c3</str>
>>      <str>2013-05-03T10:02:50.54</****str>
>>    </arr>
>>    <date name="last_modified">2013-05-****03T10:02:50Z</date>
>>    <arr name="content_type">
>>      <str>application/vnd.oasis.****opendocument.spreadsheet</str>
>>
>>    </arr>
>>    <arr name="content">
>>      <str> ???  Page   ??? (???)  00/00/0000, 00:00:00  Page  /    </str>
>>    </arr>
>>    <long name="_version_">****1434658995848609792</long></**
>>
>> doc></result></response>
>>
>>
>>   - I ask Solr to show me the extracted content from Tika doing this:
>>
>>
>> curl 
>> 'http://localhost:8983/solr/****update/extract?extractOnly=****true<http://localhost:8983/solr/**update/extract?extractOnly=**true>
>> <http://localhost:8983/**solr/update/extract?**extractOnly=true<http://localhost:8983/solr/update/extract?extractOnly=true>
>> >'
>> -H
>> 'Content-type: application/vnd.oasis.****opendocument.spreadsheet'
>>
>> --data-binary @test_ods.ods
>>
>>
>>
>>   - And I get the XHTML extracted from Tika, including the original file
>>
>>   contents and that final part that Solr is indeed indexing, so, Tika is
>>   being able to read the file but Solr is not indexing the real content,
>> it
>>   only indexes the rest:
>>
>> <body>
>> <table>
>> <tr>
>>    <td>
>>        <p>test</p>
>>    </td>
>> </tr>
>> <tr>
>>    <td>
>>        <p>de</p>
>>    </td>
>> </tr>
>> <tr>
>>    <td>
>>        <p>ods</p>
>>    </td>
>> </tr>
>> </table>
>>
>> <p xmlns="http://www.w3.org/1999/****xhtml<http://www.w3.org/1999/**xhtml><
>> http://www.w3.org/1999/xhtml>
>>
>> ">???</p>
>> <p>Page</p>
>> <p>??? (???)</p>
>> <p>00/00/0000, 00:00:00</p>
>> <p>Page / </p>
>> </body>
>>
>> Do any of you know how to fix/workaround this problem?
>>
>> Thanks!
>>
>> Sebastián Ramírez
>>
>> --
>> *-----------------------------****-----------------------*
>>
>> *This e-mail transmission, including any attachments, is intended only for
>> the named recipient(s) and may contain information that is privileged,
>> confidential and/or exempt from disclosure under applicable law. If you
>> have received this transmission in error, or are not the named
>> recipient(s), please notify Senseta immediately by return e-mail and
>> permanently delete this transmission, including any attachments.*
>>
>>
> --
> *-----------------------------**-----------------------*
> *This e-mail transmission, including any attachments, is intended only for
> the named recipient(s) and may contain information that is privileged,
> confidential and/or exempt from disclosure under applicable law. If you
> have received this transmission in error, or are not the named
> recipient(s), please notify Senseta immediately by return e-mail and
> permanently delete this transmission, including any attachments.*
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Reply via email to