Re: Solr: extracting/indexing HTML via cURL

2012-05-02 Thread Lance Norskog
You can have two fields: one which is stripped, and another which stores the original data. You can use directives and make the "stripped" field indexed but not stored, and the original field stored but not indexed. You only have to upload the file once, and only store the text once. If you look

Re: Solr: extracting/indexing HTML via cURL

2012-04-30 Thread okayndc
Great, thank you for the input. My understanding of HTMLStripCharFilter is that it strips HTML tags, which is not what I want ~ is this correct? I want to keep the HTML tags intact. On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky wrote: > If by "extracting HTML content via cURL" you mean using

Re: Solr: extracting/indexing HTML via cURL

2012-04-30 Thread Jack Krupansky
If by "extracting HTML content via cURL" you mean using SolrCell to parse html files, this seems to make sense. The sequence is that regardless of the file type, each file extraction "parser" will strip off all formatting and produce a raw text stream. Office, PDF, and HTML files are all treated