As I said, it is not a problem in the Tika library ;) I have tried with Tika 1.5 jars and it gives the same results.
Guido Medina <guido.med...@temetra.com> wrote on 23/04/2014 16:15:11: > From: Guido Medina <guido.med...@temetra.com> > To: solr-user@lucene.apache.org > Date: 23/04/2014 16:15 > Subject: Re: Problem indexing email attachments > > We particularly massage solr.war and put our own updated jars, maybe > this helps: > > http://www.apache.org/dist/tika/CHANGES-1.5.txt > > We using Tika 1.5 inside Solr with POI 3.10-Final, etc... > > Guido. > > On 23/04/14 14:38, olivier.mass...@real.lu wrote: > > Hello, > > > > I'm trying to index email files with Solr (4.7.2) > > > > The files have the extension .eml (message/rfc822) > > > > The mail body is correctly indexed but attachments are not indexed if they > > are not .txt files. > > > > If attachments are .txt files it works, but if attachment are .pdf of > > .docx files they are not indexed. > > > > > > > > I checked the extracted text by calling: > > > > curl " > > http://localhost:8983/solr/update/extract? > literal.id=doc1&commit=true&extractOnly=true&extractFormat=text > > " -F "myfile=@Test1.eml" > > > > The returned extracted text does not contain the content of the > > attachments if they are not .txt files. > > > > > > It is not a problem with the Apache Tika library not being able to process > > attachments, because running the standalone Apache Tika app by calling: > > > > > > java -jar tika-app-1.4.jar -t Test1.eml > > > > > > on my eml files correctly displays the attachments' text. > > > > > > > > Maybe is it a problem with how Tika is called by Solr ? > > > > Is there something to modify in the default configuration ? > > > > > > Thanx for any help ;) > > > > Olivier >