Hello everyone, I'm having a problem indexing content from "opendocument format" files. The files created with OpenOffice and LibreOffice (odt, ods...).
Tika is being able to read the files but Solr is not indexing the content. It's not a problem of commiting or something like that, after I post a file it is indexed and all the metadata is indexed/stored but the content isn't there. - I modified the solrconfig.xml file to catch everything: <requestHandler name="/update/extract"... <!-- here is the interesting part --> <!-- <str name="uprefix">ignored_</str> --> <str name="defaultField">all_txt</str> - Then I submitted the file to Solr: curl ' http://localhost:8983/solr/update/extract?commit=true&literal.id=newods' -H 'Content-type: application/vnd.oasis.opendocument.spreadsheet' --data-binary @test_ods.ods - Now when I do a search in Solr I get this result, there is something in the "content", but that's not the actual content of the original file: <result name="response" numFound="1" start="0"> <doc> <str name="id">newods</str> <arr name="all_txt"> <str>1</str> <str>2013-05-03T10:02:10.58</str> <str>2013-05-03T10:02:50.54</str> <str>2013-05-03T10:02:50.54</str> <str>1</str> <str>2013-05-03T10:02:10.58</str> <str>1</str> <str>2013-05-03T10:02:50.54</str> <str>2013-05-03T10:02:50.54</str> <str>0</str> <str>P0D</str> <str>2013-05-03T10:02:10.58</str> <str>1</str> <str>0</str> <str>application/ods</str> <str>0</str> <str>7322</str> <str>LibreOffice/4.0.2.2$Windows_x86 LibreOffice_project/4c82dcdd6efcd48b1d8bba66bfe1989deee49c3</str> <str>2013-05-03T10:02:50.54</str> </arr> <date name="last_modified">2013-05-03T10:02:50Z</date> <arr name="content_type"> <str>application/vnd.oasis.opendocument.spreadsheet</str> </arr> <arr name="content"> <str> ??? Page ??? (???) 00/00/0000, 00:00:00 Page / </str> </arr> <long name="_version_">1434658995848609792</long></doc></result></response> - I ask Solr to show me the extracted content from Tika doing this: curl 'http://localhost:8983/solr/update/extract?extractOnly=true' -H 'Content-type: application/vnd.oasis.opendocument.spreadsheet' --data-binary @test_ods.ods - And I get the XHTML extracted from Tika, including the original file contents and that final part that Solr is indeed indexing, so, Tika is being able to read the file but Solr is not indexing the real content, it only indexes the rest: <body> <table> <tr> <td> <p>test</p> </td> </tr> <tr> <td> <p>de</p> </td> </tr> <tr> <td> <p>ods</p> </td> </tr> </table> <p xmlns="http://www.w3.org/1999/xhtml">???</p> <p>Page</p> <p>??? (???)</p> <p>00/00/0000, 00:00:00</p> <p>Page / </p> </body> Do any of you know how to fix/workaround this problem? Thanks! Sebastián Ramírez -- *----------------------------------------------------* *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete this transmission, including any attachments.*