What is the preferred way to index html using DIH (my html is stored in a
blob field in our database)? 

I know there is the built in HTMLStripTransformer but that doesn't seem to
work well with malformed/incomplete HTML. I've created a custom transformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:

<field column="description" name="description" tidy="true"
ignoreErrors="true" propertiesFile="config/tidy.properties"/>
<field column="description" name="description" stripHTML="true"/>

However this method isn't fool-proof as you can see by my ignoreErrors
option. 

I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
Is this something I should look into? Are there any alternatives that deal
with malformed/incomplete  html? Thanks






-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to