These texts are likely from the original EML file data, but they are not visible in the content when the EML file is opened in Microsoft Outlook.
I have already applied the HTMLStripFieldUpdateProcessorFactory in solrconfig.xml, but these texts are still showing up in the index. Below is my configuration. <updateRequestProcessorChain name="html-strip-content"> <processor class="solr.HTMLStripFieldUpdateProcessorFactory"> <str name="fieldName">content_tcs</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> Regards, Edwin On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Specifically, a custome Update Request Processor chain can be used before > indexing. Probably with HTMLStripFieldUpdateProcessorFactory > Regards, > Alex > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.dam...@gmail.com wrote: > > > Hi, > > > > I think this kind of text manipulation should be done before indexing, if > > you have font-size font-family in your text, very likely you’re indexing > an > > html with css. > > If I’m right, you’re just entering in a hell of words that should be > > removed from your text. > > > > On the other hand, if you have to do this at index time, a quick and > dirty > > solution is using the pattern-replace filter. > > > > > > > https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter > > > > Ciao, > > Vincenzo > > > > -- > > mobile: 3498513251 > > skype: free.dev > > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > > > > Hi, > > > > > > I noticed that during the indexing of EMLfiles, there are words like > > > "*FONT-SIZE: > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as > > well. > > > > > > Would like to check, how are we able to remove those words during the > > > indexing? > > > > > > I am using Solr 7.5.0 > > > > > > Regards, > > > Edwin > > >