Great, thank you for the input. My understanding of HTMLStripCharFilter is that it strips HTML tags, which is not what I want ~ is this correct? I want to keep the HTML tags intact.
On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <j...@basetechnology.com>wrote: > If by "extracting HTML content via cURL" you mean using SolrCell to parse > html files, this seems to make sense. The sequence is that regardless of > the file type, each file extraction "parser" will strip off all formatting > and produce a raw text stream. Office, PDF, and HTML files are all treated > the same in that way. Then, the unformatted text stream is sent through the > field type analyzers to be tokenized into terms that Lucene can index. The > input string to the field type analyzer is what gets stored for the field, > but this occurs after the extraction file parser has already removed > formatting. > > No way for the formatting to be preserved in that case, other than to go > back to the original input document before extraction parsing. > > If you really do want to preserve full HTML formatted text, you would need > to define a field whose field type uses the HTMLStripCharFilter and then > directly add documents that direct the raw HTML to that field. > > There may be some other way to hook into the update processing chain, but > that may be too much effort compared to the HTML strip filter. > > -- Jack Krupansky > > -----Original Message----- From: okayndc > Sent: Monday, April 30, 2012 10:07 AM > To: solr-user@lucene.apache.org > Subject: Solr: extracting/indexing HTML via cURL > > > Hello, > > Over the weekend I experimented with extracting HTML content via cURL and > just > wondering why the extraction/indexing process does not include the HTML > tags. > It seems as though the HTML tags either being ignored or stripped somewhere > in the pipeline. > If this is the case, is it possible to include the HTML tags, as I would > like to keep the > formatted HTML intact? > > Any help is greatly appreciated. >