Using 6.6.0, I am able to index EML files just fine. The trick is, when indexing files containing .eml, add "-filetypes eml" to the commandline (note the plural filetypes).
Terry Steichen On 1/13/19 10:18 PM, Zheng Lin Edwin Yeo wrote: > Hi, > > I am using Solr 7.5.0 with Tika 1.18. > > Currently I am facing a situation during the indexing of EML files, whereby > the content is being extracted from the Content-type=text/html instead of > Content-type=text/plain. > > The problem with Content-type=text/html is that it contains alot of words > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of > these get indexed in Solr as well, which makes the content very cluttered, > and it also affect the search, as when we search for words like "font", all > the contents gets returned because of this. > > Would like to enquire on the following: > 1. Why Tika didn't get the text part (text/plain). Is there any way to > configure the Tika in Solr to change the priority to get the text part > (text/plain) instead of html part (text/html). > 2. If that is not possible, as you can see, the content is not clean, which > is not right. How can we get this to be clean when Tika is extracting text? > > Regards, > Edwin >