This assumes that the HTML is good quality. I don't know exactly what your use case is. If you're crawling the web you will find some very screwed-up HTML.
On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > > On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote: > >> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer? >> I guess it all depends on the "quality" of the source document. > > If you're processing HTML then you definitely want to use something like > NekoHTML or TagSoup. > > Note that Tika uses TagSoup and makes it easy to do special processing of > specific elements - you give it a content handler that gets fed a stream of > cleaned-up HTML elements. > > -- Ken > >> Le 25-août-10 à 02:09, Lance Norskog a écrit : >> >>> I would do this with regular expressions. There is a Pattern Analyzer >>> and a Tokenizer which do regular expression-based text chopping. (I'm >>> not sure how to make them do what you want). A more precise tool is >>> the RegexTransformer in the DataImportHandler. >>> >>> Lance >>> >>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan >>> <aco...@wordsearchbible.com> wrote: >>>> >>>> I'm quite new to SOLR and wondering if the following is possible: in >>>> addition to normal full text search, my users want to have the option to >>>> search only HTML heading innertext, i.e. content inside of <H1>, <H2>, >>>> or >>>> <H3> tags. >> > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > -- Lance Norskog goks...@gmail.com