I've just committed a change to ignore case when comparing tag names. -Yonik
On Thu, Apr 10, 2008 at 9:03 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > It was the intention to remove script. > I developed HTMLStripReader by just looking at a bunch of real-world HTML. > I hadn't run across script in uppercase, so I didn't do a case > insensitive check. > > The code is currently: > if (name.equals("script") || name.equals("style")) { > > Should be easy enough to change unless there is a good reason not to. > > -Yonik > > > > On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote: > > I've noticed that passing html to a field using > > HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts > too. > > For example, using a analyzer like: > > <fieldType name="HTMLStripper2" class="solr.TextField" > > > <analyzer> > > <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> > > </analyzer> > > </fieldType> > > > > with a text such as: > > <html> > > <head><title>title</title></head> > > <body> > > pre > > <SCRIPT LANGUAGE="JavaScript"> > > var time = new Date(); > > ordval= (time.getTime()); > > </SCRIPT> > > post <!-- comment --> > > </body> > > </html> > > > > Analysis.jsp turns out those tokens: > > title > > pre > > var > > time > > = > > new > > Date(); > > ordval= > > (time.getTime()); > > post > > > > While if the script in the page is commented, everything works fine. > > Is this due to design choice? Shouldn't scripts be removed in both cases? > > (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson - > > 2008-03-24 09:59:40) > > > > Walter > > > > >