Although Vincenzo and Alexandre's suggestions may be helpful in the right circumstances, there is a continuum of answers to the original question here. This continuum is mostly relevant if indexing and querying is likely to happen simultaneously or the data volume is large enough relative to the server to make you wish indexing would finish faster. Otherwise maintainability, local talent and time investment concerns probably dominate, with the caveat that in many cases, initial success may lead to a future with large data volumes or where querying and indexing do become simultaneous.
1) Vincenzo's answer would be suitable for a single or a few small fields with a very narrow set of possible html like tags. If the number of patterns that need to be matched is high or the length of the text for matching is long I would expect this solution to begin to negatively impact performance. 2) Alexandre's suggestion is much better in the case where there is a moderate amount of text and the input could be generalized html, but as the amount of text that needs to have html stripped grows the performance of the server will also degrade faster than necessary with increased indexing load. 3) If the Solr Cloud you are indexing into will need to simultaneously need to provide good response times for queries, and you are not able to supply it with an over abundance of hardware relative to the query/indexing load, then you should consider pre-processing the documents in an external ingestion system such as JesterJ, Fusion, or a variety of other solutions out there. As the indexing and query load goes up, the best practice is to move as much pre-processing work out of solr as possible so that solr can continue to do what it does well and return queries quickly. In the end, like most engineering decisions, it's a cost trade off consideration. What costs more, investing in setting up external processing or investing in server hardware. If it's a small amount of data loaded batch style prior to querying, you are in a good place and any of these will work. Just do whatever is fastest/easiest to implement. If you need to support a high volume of data being loaded into solr in a timely manner or you require minimal impact to query latency due to indexing, you want some variation of 3. -Gus On Sun, Dec 30, 2018 at 10:29 PM Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Specifically, a custome Update Request Processor chain can be used before > indexing. Probably with HTMLStripFieldUpdateProcessorFactory > Regards, > Alex > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.dam...@gmail.com wrote: > > > Hi, > > > > I think this kind of text manipulation should be done before indexing, if > > you have font-size font-family in your text, very likely you’re indexing > an > > html with css. > > If I’m right, you’re just entering in a hell of words that should be > > removed from your text. > > > > On the other hand, if you have to do this at index time, a quick and > dirty > > solution is using the pattern-replace filter. > > > > > > > https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter > > > > Ciao, > > Vincenzo > > > > -- > > mobile: 3498513251 > > skype: free.dev > > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > > > > Hi, > > > > > > I noticed that during the indexing of EMLfiles, there are words like > > > "*FONT-SIZE: > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as > > well. > > > > > > Would like to check, how are we able to remove those words during the > > > indexing? > > > > > > I am using Solr 7.5.0 > > > > > > Regards, > > > Edwin > > > -- http://www.the111shift.com