Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe
Cool, glad I was able to help. On Apr 3, 2013, at 4:18 PM, Ashok wrote: > Hi Steve, > > Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice > did the trick. I am using Solr 4.1. > > Thank you very much! > > - ashok > > > > -- > View this message in context: > http:/

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Hi Steve, Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice did the trick. I am using Solr 4.1. Thank you very much! - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.h

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Steve Rowe
Hi Ashok, HTMLStripTransformer uses HTMLStripCharFilter under the hood, and HTMLStripCharFilter converts all HTML entities to their corresponding characters. What version of Solr are you using? My guess is that it only appears that nothing is happening, since when they are presented in a brow

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Alexandre Rafalovitch
Then, I would say, you have a bigger problem However, you can probably run RegEx filter and replace those known escapes with real characters before you run your HTMLStrip filter. Or run, HTMLStrip, RegEx and HTMLStrip again. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ Lin

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Ashok
Well, the database field has text, sometimes with HTML entities and at other times with html tags. I have no control over the process that populates the database tables with info. -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTra

Re: HTML entities being missed by DIH HTMLStripTransformer

2013-04-03 Thread Gora Mohanty
On 4 April 2013 00:30, Ashok wrote: [...] > Two questions. > > (1) Is this the expected behavior of DIH HTMLStripTransformer? Yes, I believe so. > (2) If yes, is there an another transformer that I can employ first to turn > these html entities into their usual symbols that can then be removed b