Re: Indexing HTML document

2010-03-03 Thread György Frivolt
Thank you! That's even more I wanted to know. ;) Georg On Tue, Mar 2, 2010 at 10:05 PM, Walter Underwood wrote: > You are in luck, because Avi Rappoport has just written a tutorial about > how to do this. It is available from Lucid Imagination: > > > http://www.lucidimagination.com/solutions/wh

Re: Indexing HTML document

2010-03-02 Thread Walter Underwood
You are in luck, because Avi Rappoport has just written a tutorial about how to do this. It is available from Lucid Imagination: http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr I've just started reviewing it, but knowing Avi, I expect it to be very he

Re: Indexing HTML document

2010-03-02 Thread Siddhant Goel
There is an HTML filter documented here, which might be of some help - http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Control characters can be eliminated using code like this - http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-44