You are in luck, because Avi Rappoport has just written a tutorial about how to do this. It is available from Lucid Imagination:
http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr I've just started reviewing it, but knowing Avi, I expect it to be very helpful. wunder On Mar 2, 2010, at 8:28 AM, Siddhant Goel wrote: > There is an HTML filter documented here, which might be of some help - > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory > > Control characters can be eliminated using code like this - > http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-449 > > On Tue, Mar 2, 2010 at 9:37 PM, György Frivolt > <gyorgy.friv...@gmail.com>wrote: > >> Hi, How to index properly HTML documents? All the documents are HTML, some >> containing charaters encodid like ží ... Is there a character >> filter for filtering these codes? Is there a way to strip the HTML tags >> out? >> Does solr weight the terms in the document based on where they appear?.. >> words in headers (H1, H2,..) would be supposed to describe the document >> more >> then words in paragraphs. >> >> Thanks for help, >> >> Georg >> > > > > -- > - Siddhant