You are in luck, because Avi Rappoport has just written a tutorial about how to 
do this. It is available from Lucid Imagination:
http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr

I've just started reviewing it, but knowing Avi, I expect it to be very helpful.

wunder

On Mar 2, 2010, at 8:28 AM, Siddhant Goel wrote:

> There is an HTML filter documented here, which might be of some help -
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Control characters can be eliminated using code like this -
> http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-449
> 
> On Tue, Mar 2, 2010 at 9:37 PM, György Frivolt 
> <gyorgy.friv...@gmail.com>wrote:
> 
>> Hi, How to index properly HTML documents? All the documents are HTML, some
>> containing charaters encodid like &#x17E;&#xED; ... Is there a character
>> filter for filtering these codes? Is there a way to strip the HTML tags
>> out?
>> Does solr weight the terms in the document based on where they appear?..
>> words in headers (H1, H2,..) would be supposed to describe the document
>> more
>> then words in paragraphs.
>> 
>> Thanks for help,
>> 
>>  Georg
>> 
> 
> 
> 
> -- 
> - Siddhant

Reply via email to