Thanks Walter and Alex,

You are right Walter. In fact, if I'm not wrong, Tika doesn't use an
externar parser for those formats as it does with MS Office files or PDFs,
it uses java ZIP and XML libraries to parse those files directly. I guess
that would be my last resort. But I would certainly like if I was able to
make Tika process my files without the overhead of building a kind of
complicated program that extracts the contents of the file while, maybe,
Tika could do that for me.

I think that could be very related Alex. I don't know exactly what the
"mapper" does, but what you describe seems quite similar. I'm being able to
generate the XHTML from Tika with the original document content, but Solr
doesn't index that content from the XHTML.

So, maybe it's a bug in Solr cell / ExtractingRequestHandler / Tika, right?

Thanks,

Sebastián Ramírez


On Fri, May 10, 2013 at 1:59 PM, Alexandre Rafalovitch
<arafa...@gmail.com>wrote:

> On Fri, May 10, 2013 at 11:24 AM, Sebastián Ramírez
> <sebastian.rami...@senseta.com> wrote:
> > Hello everyone,
> >
> > I'm having a problem indexing content from "opendocument format" files.
> The
> > files created with OpenOffice and LibreOffice (odt, ods...).
>
>
> I wonder if it is connected to
> https://issues.apache.org/jira/browse/SOLR-4530 where the default Tika
> mapper actually keeps very little of the XHTML it gets. I fixed it for
> DIH in 4.3, but haven't looked at the CELL yet.
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*

Reply via email to