How does HTMLStripWhitespaceTokenizerFactory work?

Thierry Collogne Fri, 08 Jun 2007 03:33:32 -0700

Hello,

I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer
with no luck.


I have a field content that contains the following <field
name="content"><![CDATA[test      <a href="test">link</a>
                                post]]></field>

When I do a search I get the following

<result name="response" numFound="1" start="0">
<doc>
 <str name="content">test      &lt;a href="test"&gt;link&lt;/a&gt;
                             post</str>

 <str name="id">po_1_NL</str>
 <str name="keywords">post</str>
 <str name="titlesearch">This is a test</str>
</doc>
</result>


Is this normal? Shouldn't the html code and the white spaces be removed from
the field?

This is my config in schema.xml

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
     </analyzer>
</fieldType>

<field name="content" type="text_ws" indexed="true" stored="true"
omitNorms="false"/>

Can someone help me with this?

How does HTMLStripWhitespaceTokenizerFactory work?

Reply via email to