On Jun 9, 2010, at 8:38pm, Blargy wrote:


What is the preferred way to index html using DIH (my html is stored in a
blob field in our database)?

I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom transformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:

<field column="description" name="description" tidy="true"
ignoreErrors="true" propertiesFile="config/tidy.properties"/>
<field column="description" name="description" stripHTML="true"/>

However this method isn't fool-proof as you can see by my ignoreErrors
option.

I quickly took a peek at Tika and I noticed that it has its own HtmlParser. Is this something I should look into? Are there any alternatives that deal
with malformed/incomplete  html? Thanks

Actually the Tika HtmlParser just wraps TagSoup - that's a good option for cleaning up busted HTML.

-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225




--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to