I've just committed a change to ignore case when comparing tag names.
-Yonik

On Thu, Apr 10, 2008 at 9:03 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> It was the intention to remove script.
>  I developed HTMLStripReader by just looking at a bunch of real-world HTML.
>  I hadn't run across script in uppercase, so I didn't do a case
>  insensitive check.
>
>  The code is currently:
>     if (name.equals("script") || name.equals("style")) {
>
>  Should be easy enough to change unless there is a good reason not to.
>
>  -Yonik
>
>
>
>  On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote:
>  > I've noticed that passing html to a field using
>  > HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts 
> too.
>  >  For example, using a analyzer like:
>  >  <fieldType name="HTMLStripper2" class="solr.TextField" >
>  >          <analyzer>
>  >            <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>  >          </analyzer>
>  >  </fieldType>
>  >
>  >  with a text such as:
>  >  <html>
>  >  <head><title>title</title></head>
>  >  <body>
>  >  pre
>  >  <SCRIPT LANGUAGE="JavaScript">
>  >      var time = new Date();
>  >      ordval= (time.getTime());
>  >  </SCRIPT>
>  >  post <!-- comment -->
>  >  </body>
>  >  </html>
>  >
>  >  Analysis.jsp turns out those tokens:
>  >  title
>  >  pre
>  >  var
>  >  time
>  >  =
>  >  new
>  >  Date();
>  >  ordval=
>  >  (time.getTime());
>  >  post
>  >
>  >  While if the script in the page is commented, everything works fine.
>  >  Is this due to design choice? Shouldn't scripts be removed in both cases?
>  >  (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson -
>  > 2008-03-24 09:59:40)
>  >
>  >  Walter
>  >
>  >
>

Reply via email to