I've noticed that passing html to a field using
HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too.
For example, using a analyzer like:
<fieldType name="HTMLStripper2" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
with a text such as:
<html>
<head><title>title</title></head>
<body>
pre
<SCRIPT LANGUAGE="JavaScript">
var time = new Date();
ordval= (time.getTime());
</SCRIPT>
post <!-- comment -->
</body>
</html>
Analysis.jsp turns out those tokens:
title
pre
var
time
=
new
Date();
ordval=
(time.getTime());
post
While if the script in the page is commented, everything works fine.
Is this due to design choice? Shouldn't scripts be removed in both cases?
(Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson
- 2008-03-24 09:59:40)
Walter