You can have two fields: one which is stripped, and another which
stores the original data. You can use directives and make
the "stripped" field indexed but not stored, and the original field
stored but not indexed. You only have to upload the file once, and
only store the text once.
If you look
Great, thank you for the input. My understanding of HTMLStripCharFilter is
that it strips HTML tags, which is not what I want ~ is this correct? I
want to keep the HTML tags intact.
On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky wrote:
> If by "extracting HTML content via cURL" you mean using
If by "extracting HTML content via cURL" you mean using SolrCell to parse
html files, this seems to make sense. The sequence is that regardless of the
file type, each file extraction "parser" will strip off all formatting and
produce a raw text stream. Office, PDF, and HTML files are all treated