Re: Restricting HTML search?

Ken Krugler Wed, 25 Aug 2010 19:50:18 -0700

Actually TagSoup's reason for existence is to clean up all of themessy HTML that's out in the wild.

Tika's HTML parser wraps this, and uses it to generate the stream ofSAX events that it then consumes and turns into a normalized XHTML 1.0-compliant data stream.


-- Ken

On Aug 25, 2010, at 7:22pm, Lance Norskog wrote:

This assumes that the HTML is good quality. I don't know exactly what
your use case is. If you're crawling the web you will find some very
screwed-up HTML.

On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:
On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPathbe safer?
I guess it all depends on the "quality" of the source document.
If you're processing HTML then you definitely want to use somethinglike
NekoHTML or TagSoup.
Note that Tika uses TagSoup and makes it easy to do specialprocessing ofspecific elements - you give it a content handler that gets fed astream of
cleaned-up HTML elements.

-- Ken
Le 25-août-10 à 02:09, Lance Norskog a écrit :
I would do this with regular expressions. There is a PatternAnalyzerand a Tokenizer which do regular expression-based text chopping.(I'm
not sure how to make them do what you want). A more precise tool is
the RegexTransformer in the DataImportHandler.

Lance

On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
<aco...@wordsearchbible.com> wrote:
I'm quite new to SOLR and wondering if the following ispossible: inaddition to normal full text search, my users want to have theoption tosearch only HTML heading innertext, i.e. content inside of <H1>,<H2>,
or
<H3> tags.
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
--
Lance Norskog
goks...@gmail.com


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Restricting HTML search?

Reply via email to