Re: extracting/indexing HTML via cURL

Jack Krupansky Tue, 01 May 2012 07:24:30 -0700

Sorry for the confusion. It is doable. If you feed the raw HTML into a fieldthat has the HTMLStripCharFilter, the stored value will retain the HTMLtags, while the indexed text will be stripped of the of the tags duringanalysis and be searchable just like a normal text field. Then, search willnot see "<p>".


-- Jack Krupansky

-----Original Message-----From: okayndc

Sent: Tuesday, May 01, 2012 10:08 AM
To: solr-user@lucene.apache.org
Subject: Re: extracting/indexing HTML via cURL

Thank you Jack.

So, it's not doable/possible to search and highlight keywords within a
field that contains the raw formatted HTML?  and strip out the HTML tags
during analysis...so that a user would get back nothing if they did a
search for (ex. <p>)?

On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky<j...@basetechnology.com>wrote:

I was thinking that you wanted to index the actual text from the HTML

page, but have the stored field value still have the raw HTML with tags.If

you just want to store only the raw HTML, a simple string field is
sufficient, but then you can't easily do a text search on it.

Or, you can have two fields, one string field for the raw HTML (stored,

but not indexed) and then do a CopyField to a text field field that hasthe

HTMLStripCharFilter to strip the HTML tags and index only the text
(indexed, but not stored.)

-- Jack Krupansky

-----Original Message----- From: okayndc
Sent: Monday, April 30, 2012 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr: extracting/indexing HTML via cURL

Great, thank you for the input. My understanding of HTMLStripCharFilteris

that it strips HTML tags, which is not what I want ~ is this correct?  I
want to keep the HTML tags intact.

On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <j...@basetechnology.com>
**wrote:

 If by "extracting HTML content via cURL" you mean using SolrCell to parse

html files, this seems to make sense. The sequence is that regardless of

the file type, each file extraction "parser" will strip off allformattingand produce a raw text stream. Office, PDF, and HTML files are alltreated

the same in that way. Then, the unformatted text stream is sent through
the

field type analyzers to be tokenized into terms that Lucene can index.Theinput string to the field type analyzer is what gets stored for thefield,

but this occurs after the extraction file parser has already removed
formatting.

No way for the formatting to be preserved in that case, other than to go
back to the original input document before extraction parsing.

If you really do want to preserve full HTML formatted text, you wouldneed

to define a field whose field type uses the HTMLStripCharFilter and then
directly add documents that direct the raw HTML to that field.

There may be some other way to hook into the update processing chain, but
that may be too much effort compared to the HTML strip filter.

-- Jack Krupansky

-----Original Message----- From: okayndc
Sent: Monday, April 30, 2012 10:07 AM
To: solr-user@lucene.apache.org
Subject: Solr: extracting/indexing HTML via cURL


Hello,

Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped
somewhere
in the pipeline.
If this is the case, is it possible to include the HTML tags, as I would
like to keep the
formatted HTML intact?

Any help is greatly appreciated.

Re: extracting/indexing HTML via cURL

Reply via email to