On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:

Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer?
I guess it all depends on the "quality" of the source document.

If you're processing HTML then you definitely want to use something like NekoHTML or TagSoup.

Note that Tika uses TagSoup and makes it easy to do special processing of specific elements - you give it a content handler that gets fed a stream of cleaned-up HTML elements.

-- Ken

Le 25-août-10 à 02:09, Lance Norskog a écrit :

I would do this with regular expressions. There is a Pattern Analyzer
and a Tokenizer which do regular expression-based text chopping. (I'm
not sure how to make them do what you want). A more precise tool is
the RegexTransformer in the DataImportHandler.

Lance

On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
<aco...@wordsearchbible.com> wrote:
I'm quite new to SOLR and wondering if the following is possible: in
addition to normal full text search, my users want to have the option to search only HTML heading innertext, i.e. content inside of <H1>, <H2>, or
<H3> tags.


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to