This assumes that the HTML is good quality. I don't know exactly what
your use case is. If you're crawling the web you will find some very
screwed-up HTML.

On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:
>
> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
>
>> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer?
>> I guess it all depends on the "quality" of the source document.
>
> If you're processing HTML then you definitely want to use something like
> NekoHTML or TagSoup.
>
> Note that Tika uses TagSoup and makes it easy to do special processing of
> specific elements - you give it a content handler that gets fed a stream of
> cleaned-up HTML elements.
>
> -- Ken
>
>> Le 25-août-10 à 02:09, Lance Norskog a écrit :
>>
>>> I would do this with regular expressions. There is a Pattern Analyzer
>>> and a Tokenizer which do regular expression-based text chopping. (I'm
>>> not sure how to make them do what you want). A more precise tool is
>>> the RegexTransformer in the DataImportHandler.
>>>
>>> Lance
>>>
>>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>>> <aco...@wordsearchbible.com> wrote:
>>>>
>>>> I'm quite new to SOLR and wondering if the following is possible: in
>>>> addition to normal full text search, my users want to have the option to
>>>> search only HTML heading innertext, i.e. content inside of <H1>, <H2>,
>>>> or
>>>> <H3> tags.
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to