plain

Terry Steichen Mon, 14 Jan 2019 06:05:21 -0800

Using 6.6.0, I am able to index EML files just fine.  The trick is, when
indexing files containing .eml, add "-filetypes eml" to the commandline
(note the plural filetypes).


Terry Steichen

On 1/13/19 10:18 PM, Zheng Lin Edwin Yeo wrote:
> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
> Currently I am facing a situation during the indexing of EML files, whereby
> the content is being extracted from the Content-type=text/html instead of
> Content-type=text/plain.
>
> The problem with Content-type=text/html is that it contains alot of words
> like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> these get indexed in Solr as well, which makes the content very cluttered,
> and it also affect the search, as when we search for words like "font", all
> the contents gets returned because of this.
>
> Would like to enquire on the following:
> 1. Why Tika didn't get the text part (text/plain). Is there any way to
> configure the Tika in Solr to change the priority to get the text part
> (text/plain) instead of html part (text/html).
> 2. If that is not possible, as you can see, the content is not clean, which
> is not right. How can we get this to be clean when Tika is extracting text?
>
> Regards,
> Edwin
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Reply via email to