On Jun 9, 2010, at 8:38pm, Blargy wrote:
What is the preferred way to index html using DIH (my html is stored
in a
blob field in our database)?
I know there is the built in HTMLStripTransformer but that doesn't
seem to
work well with malformed/incomplete HTML. I've created a custom
transformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:
<field column="description" name="description" tidy="true"
ignoreErrors="true" propertiesFile="config/tidy.properties"/>
<field column="description" name="description" stripHTML="true"/>
However this method isn't fool-proof as you can see by my ignoreErrors
option.
I quickly took a peek at Tika and I noticed that it has its own
HtmlParser.
Is this something I should look into? Are there any alternatives
that deal
with malformed/incomplete html? Thanks
Actually the Tika HtmlParser just wraps TagSoup - that's a good option
for cleaning up busted HTML.
-- Ken
--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g