We particularly massage solr.war and put our own updated jars, maybe this helps:

http://www.apache.org/dist/tika/CHANGES-1.5.txt

We using Tika 1.5 inside Solr with POI 3.10-Final, etc...

Guido.

On 23/04/14 14:38, olivier.mass...@real.lu wrote:
Hello,

I'm trying to index email files with Solr (4.7.2)

The files have the extension .eml (message/rfc822)

The mail body is correctly indexed but attachments are not indexed if they
are not .txt files.

If attachments are .txt files it works, but if attachment are .pdf of
.docx files they are not indexed.



I checked the extracted text by calling:

curl "
http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true&extractOnly=true&extractFormat=text
" -F "myfile=@Test1.eml"

The returned extracted text does not contain the content of the
attachments if they are not .txt files.


It is not a problem with the Apache Tika library not being able to process
attachments, because running the standalone Apache Tika app by calling:


java -jar tika-app-1.4.jar -t Test1.eml


on my eml files correctly displays the attachments' text.



Maybe is it a problem with how Tika is called by Solr ?

Is there something to modify in the default configuration ?


Thanx for any help ;)
Olivier

Reply via email to