Re: Indexing HTML document

György Frivolt Wed, 03 Mar 2010 00:21:25 -0800

Thank you! That's even more I wanted to know. ;)

Georg



On Tue, Mar 2, 2010 at 10:05 PM, Walter Underwood <wun...@wunderwood.org>wrote:

> You are in luck, because Avi Rappoport has just written a tutorial about
> how to do this. It is available from Lucid Imagination:
>
>
> http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr
>
> I've just started reviewing it, but knowing Avi, I expect it to be very
> helpful.
>
> wunder
>
> On Mar 2, 2010, at 8:28 AM, Siddhant Goel wrote:
>
> > There is an HTML filter documented here, which might be of some help -
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> >
> > Control characters can be eliminated using code like this -
> >
> http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-449
> >
> > On Tue, Mar 2, 2010 at 9:37 PM, György Frivolt <gyorgy.friv...@gmail.com
> >wrote:
> >
> >> Hi, How to index properly HTML documents? All the documents are HTML,
> some
> >> containing charaters encodid like &#x17E;&#xED; ... Is there a character
> >> filter for filtering these codes? Is there a way to strip the HTML tags
> >> out?
> >> Does solr weight the terms in the document based on where they appear?..
> >> words in headers (H1, H2,..) would be supposed to describe the document
> >> more
> >> then words in paragraphs.
> >>
> >> Thanks for help,
> >>
> >>  Georg
> >>
> >
> >
> >
> > --
> > - Siddhant
>
>

Re: Indexing HTML document

Reply via email to