Thanks Walter and Alex, You are right Walter. In fact, if I'm not wrong, Tika doesn't use an externar parser for those formats as it does with MS Office files or PDFs, it uses java ZIP and XML libraries to parse those files directly. I guess that would be my last resort. But I would certainly like if I was able to make Tika process my files without the overhead of building a kind of complicated program that extracts the contents of the file while, maybe, Tika could do that for me.
I think that could be very related Alex. I don't know exactly what the "mapper" does, but what you describe seems quite similar. I'm being able to generate the XHTML from Tika with the original document content, but Solr doesn't index that content from the XHTML. So, maybe it's a bug in Solr cell / ExtractingRequestHandler / Tika, right? Thanks, Sebastián Ramírez On Fri, May 10, 2013 at 1:59 PM, Alexandre Rafalovitch <arafa...@gmail.com>wrote: > On Fri, May 10, 2013 at 11:24 AM, Sebastián Ramírez > <sebastian.rami...@senseta.com> wrote: > > Hello everyone, > > > > I'm having a problem indexing content from "opendocument format" files. > The > > files created with OpenOffice and LibreOffice (odt, ods...). > > > I wonder if it is connected to > https://issues.apache.org/jira/browse/SOLR-4530 where the default Tika > mapper actually keeps very little of the XHTML it gets. I fixed it for > DIH in 4.3, but haven't looked at the CELL yet. > > Regards, > Alex. > Personal blog: http://blog.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all > at once. Lately, it doesn't seem to be working. (Anonymous - via GTD > book) > -- *----------------------------------------------------* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*