John, Solr already has some of this stuff:
$ ff \*HTML\*java ./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java ./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java ./src/java/org/apache/solr/analysis/HTMLStripReader.java ./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "McBride, John" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 4:44:23 AM > Subject: Indexing HTML Content > > Hello, > > In my application I wish to index articles which are stored in HTML > format. > > Upon indexing these the html gets stored along with the content of the > article, which is undesirable. > > Do you know of any common way of parsing the text content from HTML > before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, but > I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for > a solution to work on a batch of files before being added to SOLR. > > Thanks, > John