The HTMLStripChar variants are newer and might work better. On Wed, Jun 9, 2010 at 8:38 PM, Blargy <zman...@hotmail.com> wrote: > > What is the preferred way to index html using DIH (my html is stored in a > blob field in our database)? > > I know there is the built in HTMLStripTransformer but that doesn't seem to > work well with malformed/incomplete HTML. I've created a custom transformer > to first tidy up the html using JTidy then I pass it to the > HTMLStripTransformer like so: > > <field column="description" name="description" tidy="true" > ignoreErrors="true" propertiesFile="config/tidy.properties"/> > <field column="description" name="description" stripHTML="true"/> > > However this method isn't fool-proof as you can see by my ignoreErrors > option. > > I quickly took a peek at Tika and I noticed that it has its own HtmlParser. > Is this something I should look into? Are there any alternatives that deal > with malformed/incomplete html? Thanks > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html > Sent from the Solr - User mailing list archive at Nabble.com. >
-- Lance Norskog goks...@gmail.com