Perhaps https://royvanrijn.com/blog/2016/03/java-mail-message-as-download/ may be helpful? Though I see the date on it and am now unsure. -- H
On Mon, 31 Dec 2018 at 17:51, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Alex, > > I have tried with a file that is HTML formatted, with those tags like > <html>, <head>, <body>, etc, and those gets removed during indexing. > > For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the > EML file, there are two different content type, text/html and text/plain. > Could it be due to Tika getting the content type from text/html instead of > text/plain? > > Regards, > Edwin > > On Mon, 31 Dec 2018 at 23:52, Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > > > EML is for emails, so there are probably some HTML-formatted emails > > that you are getting. Probably with the alternative text-part. Outlook > > would render HTML and/or use text part. I think you can just open EML > > in an editor to check it out. > > > > As to URP, are you absolutely sure it is being used? It is not > > declared as default, so you need to call it explicitly. Try setting a > > field in there or some other clear flag that a record has been > > processed. > > > > Regards, > > Alex. > > > > On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > > > > These texts are likely from the original EML file data, but they are > not > > > visible in the content when the EML file is opened in Microsoft > Outlook. > > > > > > I have already applied the HTMLStripFieldUpdateProcessorFactory in > > > solrconfig.xml, but these texts are still showing up in the index. > Below > > is > > > my configuration. > > > > > > <updateRequestProcessorChain name="html-strip-content"> > > > > > > <processor > > > class="solr.HTMLStripFieldUpdateProcessorFactory"> > > > > > > <str > > > name="fieldName">content_tcs</str> > > > > > > </processor> > > > > > > <processor > > > class="solr.LogUpdateProcessorFactory" /> > > > > > > <processor > > > class="solr.RunUpdateProcessorFactory" /> > > > > > > </updateRequestProcessorChain> > > > > > > > > > Regards, > > > Edwin > > > > > > On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch < > arafa...@gmail.com> > > > wrote: > > > > > > > Specifically, a custome Update Request Processor chain can be used > > before > > > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory > > > > Regards, > > > > Alex > > > > > > > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.dam...@gmail.com > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I think this kind of text manipulation should be done before > > indexing, if > > > > > you have font-size font-family in your text, very likely you’re > > indexing > > > > an > > > > > html with css. > > > > > If I’m right, you’re just entering in a hell of words that should > be > > > > > removed from your text. > > > > > > > > > > On the other hand, if you have to do this at index time, a quick > and > > > > dirty > > > > > solution is using the pattern-replace filter. > > > > > > > > > > > > > > > > > > > > > > https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter > > > > > > > > > > Ciao, > > > > > Vincenzo > > > > > > > > > > -- > > > > > mobile: 3498513251 > > > > > skype: free.dev > > > > > > > > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo < > > edwinye...@gmail.com> > > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > I noticed that during the indexing of EMLfiles, there are words > > like > > > > > > "*FONT-SIZE: > > > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content > > as > > > > > well. > > > > > > > > > > > > Would like to check, how are we able to remove those words during > > the > > > > > > indexing? > > > > > > > > > > > > I am using Solr 7.5.0 > > > > > > > > > > > > Regards, > > > > > > Edwin > > > > > > > > > > > > -- OpenPGP: https://sks-keyservers.net/pks/lookup?op=get&search=0xFEBAD7FFD041BBA1 If you wish to request my time, please do so using *bit.ly/hd1AppointmentRequest <http://bit.ly/hd1AppointmentRequest>*. Si vous voudrais faire connnaisance, allez a *bit.ly/hd1AppointmentRequest <http://bit.ly/hd1AppointmentRequest>*. <https://sks-keyservers.net/pks/lookup?op=get&search=0xFEBAD7FFD041BBA1>Sent from my mobile device Envoye de mon portable