At 9:32 PM +1000 10/5/07, Adrian Sutton wrote:
>From what people are suggesting though you'd be better off converting to plain 
>text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) 
>can parse most HTML that's around and you can iterate over the DOM to extract 
>the text from there.

It depends entirely on the use-case.  You can fire HTML or XML at a Solr field 
(possibly wrapping it in a CDATA block as just suggested by Pieter Berkel) and 
have it stored verbatim, then what happens at index-time is entirely dependent 
on the Analyzer chain: Treat tags and attributes as if they were text, remove 
them entirely, etc.  You can strip the markup before sending the data and so 
store and/or index just the text content.  You can use XSLT or other means to 
extract data to be indexed in specific fields.  And, as Benoit Pauwels just 
wrote, a combination of these techniques might be the most appropriate for a 
particular application, e.g. field-specific search yielding marked-up documents.

The HTMLStripXXX tokenizers appear to do a fine job of entity conversion and 
tag stripping, and so if highlighting is not a consideration then it makes the 
markup stripping very convenient, allowing storage of the document with markup 
and indexing of just the text content.

The primary issue with HTMLStripXXX is for the use-case when one wants to 
return the stored HTML/XML content with highlighting markup inserted around the 
text content, but preserving the original markup.  For example, have
    <topic type="location">Paris</topic>
highlighted as
    <topic type="location"><span class="highlighted">Paris</span></topic>

For that the original marked-up version (rather than stripped) must be stored, 
a markup-stripped version should probably (but not necessarily) be indexed, and 
the offsets of the indexed tokens must properly point to the locations of those 
tokens in the stored version.  The HTMLStripXXX tokenizers ignore the offset of 
the stripped content (both tags and attributes, but also when entities are 
converted to characters) and so the token /paris/ in the example above is given 
the offset of the opening <, and the highlighting falls within (and thus 
destroys) the <topic > tag.  The PatternTokenizer workaround posted to SOLR-42 
will fulfill this use-case.

But a different use-case might be for the highlighting to encompass the markup 
rather than just the text, e.g.
    <span class="highlighted"><topic type="location">Paris</topic></span>
which would have to be accomplished some other way.

- J.J.

Reply via email to