Cool! I did not know that Tika had a thorough&careful HTML parser. On Wed, Aug 25, 2010 at 7:49 PM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > Actually TagSoup's reason for existence is to clean up all of the messy HTML > that's out in the wild. > > Tika's HTML parser wraps this, and uses it to generate the stream of SAX > events that it then consumes and turns into a normalized XHTML 1.0-compliant > data stream. > > -- Ken > > On Aug 25, 2010, at 7:22pm, Lance Norskog wrote: > >> This assumes that the HTML is good quality. I don't know exactly what >> your use case is. If you're crawling the web you will find some very >> screwed-up HTML. >> >> On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler >> <kkrugler_li...@transpac.com> wrote: >>> >>> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote: >>> >>>> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be >>>> safer? >>>> I guess it all depends on the "quality" of the source document. >>> >>> If you're processing HTML then you definitely want to use something like >>> NekoHTML or TagSoup. >>> >>> Note that Tika uses TagSoup and makes it easy to do special processing of >>> specific elements - you give it a content handler that gets fed a stream >>> of >>> cleaned-up HTML elements. >>> >>> -- Ken >>> >>>> Le 25-août-10 à 02:09, Lance Norskog a écrit : >>>> >>>>> I would do this with regular expressions. There is a Pattern Analyzer >>>>> and a Tokenizer which do regular expression-based text chopping. (I'm >>>>> not sure how to make them do what you want). A more precise tool is >>>>> the RegexTransformer in the DataImportHandler. >>>>> >>>>> Lance >>>>> >>>>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan >>>>> <aco...@wordsearchbible.com> wrote: >>>>>> >>>>>> I'm quite new to SOLR and wondering if the following is possible: in >>>>>> addition to normal full text search, my users want to have the option >>>>>> to >>>>>> search only HTML heading innertext, i.e. content inside of <H1>, <H2>, >>>>>> or >>>>>> <H3> tags. >>>> >>> >>> -------------------------------------------- >>> Ken Krugler >>> +1 530-210-6378 >>> http://bixolabs.com >>> e l a s t i c w e b m i n i n g >>> >>> >>> >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > >
-- Lance Norskog goks...@gmail.com