At 9:32 PM +1000 10/5/07, Adrian Sutton wrote: >From what people are suggesting though you'd be better off converting to plain >text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) >can parse most HTML that's around and you can iterate over the DOM to extract >the text from there.
It depends entirely on the use-case. You can fire HTML or XML at a Solr field (possibly wrapping it in a CDATA block as just suggested by Pieter Berkel) and have it stored verbatim, then what happens at index-time is entirely dependent on the Analyzer chain: Treat tags and attributes as if they were text, remove them entirely, etc. You can strip the markup before sending the data and so store and/or index just the text content. You can use XSLT or other means to extract data to be indexed in specific fields. And, as Benoit Pauwels just wrote, a combination of these techniques might be the most appropriate for a particular application, e.g. field-specific search yielding marked-up documents. The HTMLStripXXX tokenizers appear to do a fine job of entity conversion and tag stripping, and so if highlighting is not a consideration then it makes the markup stripping very convenient, allowing storage of the document with markup and indexing of just the text content. The primary issue with HTMLStripXXX is for the use-case when one wants to return the stored HTML/XML content with highlighting markup inserted around the text content, but preserving the original markup. For example, have <topic type="location">Paris</topic> highlighted as <topic type="location"><span class="highlighted">Paris</span></topic> For that the original marked-up version (rather than stripped) must be stored, a markup-stripped version should probably (but not necessarily) be indexed, and the offsets of the indexed tokens must properly point to the locations of those tokens in the stored version. The HTMLStripXXX tokenizers ignore the offset of the stripped content (both tags and attributes, but also when entities are converted to characters) and so the token /paris/ in the example above is given the offset of the opening <, and the highlighting falls within (and thus destroys) the <topic > tag. The PatternTokenizer workaround posted to SOLR-42 will fulfill this use-case. But a different use-case might be for the highlighting to encompass the markup rather than just the text, e.g. <span class="highlighted"><topic type="location">Paris</topic></span> which would have to be accomplished some other way. - J.J.