Cool!  I did not know that Tika had a thorough&careful HTML parser.

On Wed, Aug 25, 2010 at 7:49 PM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:
> Actually TagSoup's reason for existence is to clean up all of the messy HTML
> that's out in the wild.
>
> Tika's HTML parser wraps this, and uses it to generate the stream of SAX
> events that it then consumes and turns into a normalized XHTML 1.0-compliant
> data stream.
>
> -- Ken
>
> On Aug 25, 2010, at 7:22pm, Lance Norskog wrote:
>
>> This assumes that the HTML is good quality. I don't know exactly what
>> your use case is. If you're crawling the web you will find some very
>> screwed-up HTML.
>>
>> On Wed, Aug 25, 2010 at 6:45 AM, Ken Krugler
>> <kkrugler_li...@transpac.com> wrote:
>>>
>>> On Aug 24, 2010, at 10:55pm, Paul Libbrecht wrote:
>>>
>>>> Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be
>>>> safer?
>>>> I guess it all depends on the "quality" of the source document.
>>>
>>> If you're processing HTML then you definitely want to use something like
>>> NekoHTML or TagSoup.
>>>
>>> Note that Tika uses TagSoup and makes it easy to do special processing of
>>> specific elements - you give it a content handler that gets fed a stream
>>> of
>>> cleaned-up HTML elements.
>>>
>>> -- Ken
>>>
>>>> Le 25-août-10 à 02:09, Lance Norskog a écrit :
>>>>
>>>>> I would do this with regular expressions. There is a Pattern Analyzer
>>>>> and a Tokenizer which do regular expression-based text chopping. (I'm
>>>>> not sure how to make them do what you want). A more precise tool is
>>>>> the RegexTransformer in the DataImportHandler.
>>>>>
>>>>> Lance
>>>>>
>>>>> On Tue, Aug 24, 2010 at 7:08 AM, Andrew Cogan
>>>>> <aco...@wordsearchbible.com> wrote:
>>>>>>
>>>>>> I'm quite new to SOLR and wondering if the following is possible: in
>>>>>> addition to normal full text search, my users want to have the option
>>>>>> to
>>>>>> search only HTML heading innertext, i.e. content inside of <H1>, <H2>,
>>>>>> or
>>>>>> <H3> tags.
>>>>
>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to