The HTMLStripReader tool worked very well for us. It handles garbled HTML well. The only hole we found was that it does not find alt-text attributes for images.
Also, note that this code is written as a Java Reader class rather than a Solr class. This makes it useful for other projects. Given the amount of string processing it does, the fact that it is a Reader probably does not affect its performance. Cheers, Lance -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 10:14 AM To: solr-user@lucene.apache.org Subject: Re: Indexing HTML Content John, Solr already has some of this stuff: $ ff \*HTML\*java ./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java ./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java ./src/java/org/apache/solr/analysis/HTMLStripReader.java ./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "McBride, John" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 4:44:23 AM > Subject: Indexing HTML Content > > Hello, > > In my application I wish to index articles which are stored in HTML > format. > > Upon indexing these the html gets stored along with the content of the > article, which is undesirable. > > Do you know of any common way of parsing the text content from HTML > before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, > but I am using SOLR 1.2 and won't use 1.3 until it's stable, so > looking for a solution to work on a batch of files before being added to SOLR. > > Thanks, > John