It is simply a question of whether or not you wish to have the raw HTML stored in the field so that it can be returned to the application for display purposes. If you simply want the HTML to do away as soon as possible, use “stripHTML”, but then there is no need to use the factory on the field in the Solr schema. But, if you do want to preserve the HTML for later output, don’t “stripHTML”, but do use the factory in the Solr schema since the index should not have the HTML even though the “stored” field value will retain the full HTML.
-- Jack Krupansky From: Sergio Martín Cantero Sent: Friday, May 18, 2012 10:53 AM To: solr-user@lucene.apache.org Subject: StripHTML and HTMLStripCharFilterFactory Hello. Could you tell me the difference between this two? 1) Having a DIH with a field in data-import-config.xml like this: <field column="body" name="article" stripHTML="true"/> b) Having the Schema.xml with a field like this: <fieldType name="textNoHtml" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory"/> </analyzer> </fieldType> <field name="article" type="textNoHtml" indexed="true" stored="true" /> I assume when I call to the DIH, it first removes the HMTL, and then, when indexing, the HTML should be removed again, but the HTML was already removed by stripHML in data-import-config. Si, doesn it make sense to declare a field as stripHTML=true when than field will be stored in a field with a HTMLStripCharFilterFactory? Thanks for you help. Sergio Martín Cantero Office (ES) +34 91 733 73 97 playence Spain SL sergio.mar...@playence.com Calle Vicente Gaceo 19 28029 Madrid - España