It is simply a question of whether or not you wish to have the raw HTML stored 
in the field so that it can be returned to the application for display 
purposes. If you simply want the HTML to do away as soon as possible, use 
“stripHTML”, but then there is no need to use the factory on the field in the 
Solr schema. But, if you do want to preserve the HTML for later output, don’t 
“stripHTML”, but do use the factory in the Solr schema since the index should 
not have the HTML even though the “stored” field value will retain the full 
HTML.

-- Jack Krupansky

From: Sergio Martín Cantero 
Sent: Friday, May 18, 2012 10:53 AM
To: solr-user@lucene.apache.org 
Subject: StripHTML and HTMLStripCharFilterFactory

Hello.
Could you tell me the difference between this two?

1) Having a DIH with a field in data-import-config.xml like this:
<field column="body" name="article" stripHTML="true"/>

b) Having the Schema.xml with a field like this:
    <fieldType name="textNoHtml" class="solr.TextField" 
positionIncrementGap="100">
        <analyzer type="index">
            <charFilter class="solr.HTMLStripCharFilterFactory"/>
        </analyzer>
    </fieldType>

    <field name="article" type="textNoHtml" indexed="true" stored="true" />

I assume when I call to the DIH, it first removes the HMTL, and then, when 
indexing, the HTML should be removed again, but the HTML was already removed by 
stripHML in data-import-config.
Si, doesn it make sense to declare a field as stripHTML=true when than field 
will be stored in a field with a HTMLStripCharFilterFactory?

Thanks for you help.


  
      Sergio Martín Cantero Office (ES) +34 91 733 73 97 
      playence Spain SL sergio.mar...@playence.com 
      Calle Vicente Gaceo 19 
     
      28029 Madrid - España   

Reply via email to