Great, thank you for the input.  My understanding of HTMLStripCharFilter is
that it strips HTML tags, which is not what I want ~ is this correct?  I
want to keep the HTML tags intact.

On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <j...@basetechnology.com>wrote:

> If by "extracting HTML content via cURL" you mean using SolrCell to parse
> html files, this seems to make sense. The sequence is that regardless of
> the file type, each file extraction "parser" will strip off all formatting
> and produce a raw text stream. Office, PDF, and HTML files are all treated
> the same in that way. Then, the unformatted text stream is sent through the
> field type analyzers to be tokenized into terms that Lucene can index. The
> input string to the field type analyzer is what gets stored for the field,
> but this occurs after the extraction file parser has already removed
> formatting.
>
> No way for the formatting to be preserved in that case, other than to go
> back to the original input document before extraction parsing.
>
> If you really do want to preserve full HTML formatted text, you would need
> to define a field whose field type uses the HTMLStripCharFilter and then
> directly add documents that direct the raw HTML to that field.
>
> There may be some other way to hook into the update processing chain, but
> that may be too much effort compared to the HTML strip filter.
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Monday, April 30, 2012 10:07 AM
> To: solr-user@lucene.apache.org
> Subject: Solr: extracting/indexing HTML via cURL
>
>
> Hello,
>
> Over the weekend I experimented with extracting HTML content via cURL and
> just
> wondering why the extraction/indexing process does not include the HTML
> tags.
> It seems as though the HTML tags either being ignored or stripped somewhere
> in the pipeline.
> If this is the case, is it possible to include the HTML tags, as I would
> like to keep the
> formatted HTML intact?
>
> Any help is greatly appreciated.
>

Reply via email to