What is the preferred way to index html using DIH (my html is stored in a blob field in our database)?
I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom transformer to first tidy up the html using JTidy then I pass it to the HTMLStripTransformer like so: <field column="description" name="description" tidy="true" ignoreErrors="true" propertiesFile="config/tidy.properties"/> <field column="description" name="description" stripHTML="true"/> However this method isn't fool-proof as you can see by my ignoreErrors option. I quickly took a peek at Tika and I noticed that it has its own HtmlParser. Is this something I should look into? Are there any alternatives that deal with malformed/incomplete html? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html Sent from the Solr - User mailing list archive at Nabble.com.