It was the intention to remove script.
I developed HTMLStripReader by just looking at a bunch of real-world HTML.
I hadn't run across script in uppercase, so I didn't do a case
insensitive check.

The code is currently:
    if (name.equals("script") || name.equals("style")) {

Should be easy enough to change unless there is a good reason not to.

-Yonik

On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote:
> I've noticed that passing html to a field using
> HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too.
>  For example, using a analyzer like:
>  <fieldType name="HTMLStripper2" class="solr.TextField" >
>          <analyzer>
>            <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>          </analyzer>
>  </fieldType>
>
>  with a text such as:
>  <html>
>  <head><title>title</title></head>
>  <body>
>  pre
>  <SCRIPT LANGUAGE="JavaScript">
>      var time = new Date();
>      ordval= (time.getTime());
>  </SCRIPT>
>  post <!-- comment -->
>  </body>
>  </html>
>
>  Analysis.jsp turns out those tokens:
>  title
>  pre
>  var
>  time
>  =
>  new
>  Date();
>  ordval=
>  (time.getTime());
>  post
>
>  While if the script in the page is commented, everything works fine.
>  Is this due to design choice? Shouldn't scripts be removed in both cases?
>  (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson -
> 2008-03-24 09:59:40)
>
>  Walter
>
>

Reply via email to