Thank you! That's even more I wanted to know. ;) Georg
On Tue, Mar 2, 2010 at 10:05 PM, Walter Underwood <wun...@wunderwood.org>wrote: > You are in luck, because Avi Rappoport has just written a tutorial about > how to do this. It is available from Lucid Imagination: > > > http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr > > I've just started reviewing it, but knowing Avi, I expect it to be very > helpful. > > wunder > > On Mar 2, 2010, at 8:28 AM, Siddhant Goel wrote: > > > There is an HTML filter documented here, which might be of some help - > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory > > > > Control characters can be eliminated using code like this - > > > http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-449 > > > > On Tue, Mar 2, 2010 at 9:37 PM, György Frivolt <gyorgy.friv...@gmail.com > >wrote: > > > >> Hi, How to index properly HTML documents? All the documents are HTML, > some > >> containing charaters encodid like ží ... Is there a character > >> filter for filtering these codes? Is there a way to strip the HTML tags > >> out? > >> Does solr weight the terms in the document based on where they appear?.. > >> words in headers (H1, H2,..) would be supposed to describe the document > >> more > >> then words in paragraphs. > >> > >> Thanks for help, > >> > >> Georg > >> > > > > > > > > -- > > - Siddhant > >