There is an HTML filter documented here, which might be of some help -
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Control characters can be eliminated using code like this -
http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-449

On Tue, Mar 2, 2010 at 9:37 PM, György Frivolt <gyorgy.friv...@gmail.com>wrote:

> Hi, How to index properly HTML documents? All the documents are HTML, some
> containing charaters encodid like &#x17E;&#xED; ... Is there a character
> filter for filtering these codes? Is there a way to strip the HTML tags
> out?
> Does solr weight the terms in the document based on where they appear?..
> words in headers (H1, H2,..) would be supposed to describe the document
> more
> then words in paragraphs.
>
> Thanks for help,
>
>   Georg
>



-- 
- Siddhant

Reply via email to