There is an HTML filter documented here, which might be of some help - http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
Control characters can be eliminated using code like this - http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-449 On Tue, Mar 2, 2010 at 9:37 PM, György Frivolt <gyorgy.friv...@gmail.com>wrote: > Hi, How to index properly HTML documents? All the documents are HTML, some > containing charaters encodid like ží ... Is there a character > filter for filtering these codes? Is there a way to strip the HTML tags > out? > Does solr weight the terms in the document based on where they appear?.. > words in headers (H1, H2,..) would be supposed to describe the document > more > then words in paragraphs. > > Thanks for help, > > Georg > -- - Siddhant