Thanks for your reply. What I have found is that in the EML file, there are 2 Content-Type, one is text/html, and the other is text/plain.
The text/html will words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, but for the text/plain, there is no such words, and the content is clean (just what is in the email). As such, I believe that the indexing is done on the text/html part. Is there any way that we can change the settings so that the indexing is done on the text/plain part? Regards, Edwin On Wed, 2 Jan 2019 at 03:27, Gus Heck <gus.h...@gmail.com> wrote: > Although Vincenzo and Alexandre's suggestions may be helpful in the right > circumstances, there is a continuum of answers to the original question > here. This continuum is mostly relevant if indexing and querying is likely > to happen simultaneously or the data volume is large enough relative to the > server to make you wish indexing would finish faster. Otherwise > maintainability, local talent and time investment concerns probably > dominate, with the caveat that in many cases, initial success may lead to a > future with large data volumes or where querying and indexing do become > simultaneous. > > 1) Vincenzo's answer would be suitable for a single or a few small fields > with a very narrow set of possible html like tags. If the number of > patterns that need to be matched is high or the length of the text for > matching is long I would expect this solution to begin to negatively impact > performance. > > 2) Alexandre's suggestion is much better in the case where there is a > moderate amount of text and the input could be generalized html, but as the > amount of text that needs to have html stripped grows the performance of > the server will also degrade faster than necessary with increased indexing > load. > > 3) If the Solr Cloud you are indexing into will need to simultaneously need > to provide good response times for queries, and you are not able to supply > it with an over abundance of hardware relative to the query/indexing load, > then you should consider pre-processing the documents in an external > ingestion system such as JesterJ, Fusion, or a variety of other solutions > out there. As the indexing and query load goes up, the best practice is to > move as much pre-processing work out of solr as possible so that solr can > continue to do what it does well and return queries quickly. > > In the end, like most engineering decisions, it's a cost trade off > consideration. What costs more, investing in setting up external processing > or investing in server hardware. If it's a small amount of data loaded > batch style prior to querying, you are in a good place and any of these > will work. Just do whatever is fastest/easiest to implement. If you need to > support a high volume of data being loaded into solr in a timely manner or > you require minimal impact to query latency due to indexing, you want some > variation of 3. > > -Gus > > On Sun, Dec 30, 2018 at 10:29 PM Alexandre Rafalovitch <arafa...@gmail.com > > > wrote: > > > Specifically, a custome Update Request Processor chain can be used before > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory > > Regards, > > Alex > > > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.dam...@gmail.com > wrote: > > > > > Hi, > > > > > > I think this kind of text manipulation should be done before indexing, > if > > > you have font-size font-family in your text, very likely you’re > indexing > > an > > > html with css. > > > If I’m right, you’re just entering in a hell of words that should be > > > removed from your text. > > > > > > On the other hand, if you have to do this at index time, a quick and > > dirty > > > solution is using the pattern-replace filter. > > > > > > > > > > > > https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter > > > > > > Ciao, > > > Vincenzo > > > > > > -- > > > mobile: 3498513251 > > > skype: free.dev > > > > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > > wrote: > > > > > > > > Hi, > > > > > > > > I noticed that during the indexing of EMLfiles, there are words like > > > > "*FONT-SIZE: > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as > > > well. > > > > > > > > Would like to check, how are we able to remove those words during the > > > > indexing? > > > > > > > > I am using Solr 7.5.0 > > > > > > > > Regards, > > > > Edwin > > > > > > > > -- > http://www.the111shift.com >