Hi Ashok,

HTMLStripTransformer uses HTMLStripCharFilter under the hood, and 
HTMLStripCharFilter converts all HTML entities to their corresponding 
characters.

What version of Solr are you using?

My guess is that it only appears that nothing is happening, since when they are 
presented in a browser, they show up as the characters the entities represent.

I think (never done this myself) that if you apply the HTMLStripTransformer 
twice, it will first convert the entities to characters, and then on the second 
pass, remove the HTML constructs.

From <http://wiki.apache.org/solr/DataImportHandler#Transformer>:

-----
The entity transformer attribute can consist of a comma separated list of 
transformers (say transformer="foo.X,foo.Y"). The transformers are chained in 
this case and they are applied one after the other in the order in which they 
are specified. What this means is that after the fields are fetched from the 
datasource, the list of entity columns are processed one at a time in the order 
listed inside the entity tag and scanned by the first transformer to see if any 
of that transformers attributes are present. If so the transformer does it's 
thing! When all of the listed entity columns have been scanned the process is 
repeated using the next transformer in the list.
-----

Steve

On Apr 3, 2013, at 3:30 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote:

> Then, I would say, you have a bigger problem....
> 
> However, you can probably run RegEx filter and replace those known escapes
> with real characters before you run your HTMLStrip filter. Or run,
> HTMLStrip, RegEx and HTMLStrip again.
> 
> Regards,
>   Alex.
> 
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Wed, Apr 3, 2013 at 3:19 PM, Ashok <ash...@qualcomm.com> wrote:
> 
>> Well, the database field has text,  sometimes with HTML entities and at
>> other
>> times with html tags. I have no control over the process that populates the
>> database tables with info.


Reply via email to