best practice handling html content

Markus.Rietzler Mon, 19 Apr 2010 08:25:58 -0700

hello,

we want to index and search in our intranet documents.
the field "body" contains html-tags.


in our schema.xml we have a fieldType text_de (see at the end of this mail) 
which uses charFilter solr.HTMLStripCharFilterFactory with index. 
so this is no problem. the text is put into the index without any html. i can 
do search over this field, also html entities like &auml; for a german umlaut 
(ä) do work, &nbsp; are filtered out correct, support for german language etc.

so now comes the problem. the field body is defined like

<field name="body" type="text_de" indexed="true" stored="true" />

so we do index it and also store the content. on the result page when we are 
printing body or the highlighing on body we have all the html tags back. sounds 
correct, as the HTML-Filter only works on the indexing...

so my question is, how is the best way to handle this case? strip out all html 
before adding the document to the index.
let solr do the html-filtering and then using some additional filtering on the 
GUI frontend when printing the search result?

or do i have misunderstand something?

thank you

markus


---- schema.xml ----

    <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" 
synonyms="index_synonyms_de.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_de.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German" 
protected="protwords_de.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_de.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German" 
protected="protwords_de.txt"/>
      </analyzer>
    </fieldType>

best practice handling html content

Reply via email to