The HTMLStripReader tool worked very well for us. It handles garbled HTML
well. The only hole we found was that it does not find alt-text attributes
for images.

Also, note that this code is written as a Java Reader class rather than a
Solr class. This makes it useful for other projects. Given the amount of
string processing it does, the fact that it is a Reader probably does not
affect its performance.

Cheers,

Lance

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 22, 2008 10:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing HTML Content

John,

Solr already has some of this stuff:

$ ff \*HTML\*java
./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java
./src/java/org/apache/solr/analysis/HTMLStripReader.java
./src/java/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.java


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "McBride, John" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 4:44:23 AM
> Subject: Indexing HTML Content
> 
> Hello,
> 
> In my application I wish to index articles which are stored in HTML 
> format.
> 
> Upon indexing these the html gets stored along with the content of the 
> article, which is undesirable.
> 
> Do you know of any common way of parsing the text content from HTML 
> before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, 
> but I am using SOLR 1.2 and won't use 1.3 until it's stable, so 
> looking for a solution to work on a batch of files before being added to
SOLR.
> 
> Thanks,
> John


Reply via email to