John,

Solr already has some of this stuff:

$ ff \*HTML\*java
./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java
./src/java/org/apache/solr/analysis/HTMLStripReader.java
./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "McBride, John" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 4:44:23 AM
> Subject: Indexing HTML Content
> 
> Hello,
> 
> In my application I wish to index articles which are stored in HTML
> format.
> 
> Upon indexing these the html gets stored along with the content of the
> article, which is undesirable.
> 
> Do you know of any common way of parsing the text content from HTML
> before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
> I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
> a solution to work on a batch of files before being added to SOLR.
> 
> Thanks,
> John

Reply via email to