Re: Indexing HTML

Ken Krugler Wed, 09 Jun 2010 22:26:56 -0700


On Jun 9, 2010, at 8:38pm, Blargy wrote:

What is the preferred way to index html using DIH (my html is storedin a
blob field in our database)?
I know there is the built in HTMLStripTransformer but that doesn'tseem towork well with malformed/incomplete HTML. I've created a customtransformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:

<field column="description" name="description" tidy="true"
ignoreErrors="true" propertiesFile="config/tidy.properties"/>
<field column="description" name="description" stripHTML="true"/>

However this method isn't fool-proof as you can see by my ignoreErrors
option.
I quickly took a peek at Tika and I noticed that it has its ownHtmlParser.Is this something I should look into? Are there any alternativesthat deal
with malformed/incomplete  html? Thanks

Actually the Tika HtmlParser just wraps TagSoup - that's a good optionfor cleaning up busted HTML.


-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225




--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Indexing HTML

Reply via email to