It was the intention to remove script. I developed HTMLStripReader by just looking at a bunch of real-world HTML. I hadn't run across script in uppercase, so I didn't do a case insensitive check.
The code is currently: if (name.equals("script") || name.equals("style")) { Should be easy enough to change unless there is a good reason not to. -Yonik On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote: > I've noticed that passing html to a field using > HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too. > For example, using a analyzer like: > <fieldType name="HTMLStripper2" class="solr.TextField" > > <analyzer> > <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> > </analyzer> > </fieldType> > > with a text such as: > <html> > <head><title>title</title></head> > <body> > pre > <SCRIPT LANGUAGE="JavaScript"> > var time = new Date(); > ordval= (time.getTime()); > </SCRIPT> > post <!-- comment --> > </body> > </html> > > Analysis.jsp turns out those tokens: > title > pre > var > time > = > new > Date(); > ordval= > (time.getTime()); > post > > While if the script in the page is commented, everything works fine. > Is this due to design choice? Shouldn't scripts be removed in both cases? > (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson - > 2008-03-24 09:59:40) > > Walter > >