Re: Indexing HTML

Lance Norskog Wed, 09 Jun 2010 20:41:03 -0700

The HTMLStripChar variants are newer and might work better.

On Wed, Jun 9, 2010 at 8:38 PM, Blargy <zman...@hotmail.com> wrote:
>
> What is the preferred way to index html using DIH (my html is stored in a
> blob field in our database)?
>
> I know there is the built in HTMLStripTransformer but that doesn't seem to
> work well with malformed/incomplete HTML. I've created a custom transformer
> to first tidy up the html using JTidy then I pass it to the
> HTMLStripTransformer like so:
>
> <field column="description" name="description" tidy="true"
> ignoreErrors="true" propertiesFile="config/tidy.properties"/>
> <field column="description" name="description" stripHTML="true"/>
>
> However this method isn't fool-proof as you can see by my ignoreErrors
> option.
>
> I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
> Is this something I should look into? Are there any alternatives that deal
> with malformed/incomplete  html? Thanks
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>




-- 
Lance Norskog
goks...@gmail.com

Re: Indexing HTML

Reply via email to