Re: Solr: extracting/indexing HTML via cURL

Jack Krupansky Mon, 30 Apr 2012 08:55:44 -0700

If by "extracting HTML content via cURL" you mean using SolrCell to parsehtml files, this seems to make sense. The sequence is that regardless of thefile type, each file extraction "parser" will strip off all formatting andproduce a raw text stream. Office, PDF, and HTML files are all treated thesame in that way. Then, the unformatted text stream is sent through thefield type analyzers to be tokenized into terms that Lucene can index. Theinput string to the field type analyzer is what gets stored for the field,but this occurs after the extraction file parser has already removedformatting.

No way for the formatting to be preserved in that case, other than to goback to the original input document before extraction parsing.

If you really do want to preserve full HTML formatted text, you would needto define a field whose field type uses the HTMLStripCharFilter and thendirectly add documents that direct the raw HTML to that field.

There may be some other way to hook into the update processing chain, butthat may be too much effort compared to the HTML strip filter.


-- Jack Krupansky

-----Original Message-----From: okayndc

Sent: Monday, April 30, 2012 10:07 AM
To: solr-user@lucene.apache.org
Subject: Solr: extracting/indexing HTML via cURL

Hello,

Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped somewhere
in the pipeline.
If this is the case, is it possible to include the HTML tags, as I would
like to keep the
formatted HTML intact?

Any help is greatly appreciated.

Re: Solr: extracting/indexing HTML via cURL

Reply via email to