I've noticed that passing html to a field using HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too.
For example, using a analyzer like:
<fieldType name="HTMLStripper2" class="solr.TextField" >
         <analyzer>
           <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
         </analyzer>
</fieldType>

with a text such as:
<html>
<head><title>title</title></head>
<body>
pre
<SCRIPT LANGUAGE="JavaScript">
     var time = new Date();
     ordval= (time.getTime());
</SCRIPT>
post <!-- comment -->
</body>
</html>

Analysis.jsp turns out those tokens:
title
pre
var
time
=
new
Date();
ordval=
(time.getTime());
post

While if the script in the page is commented, everything works fine.
Is this due to design choice? Shouldn't scripts be removed in both cases?
(Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson - 2008-03-24 09:59:40)

Walter

Reply via email to