Thanks Adrian,  I'm very new to Solr myself so struggling a bit in
initial stages...

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?

Ravish

On 10/5/07, Adrian Sutton <[EMAIL PROTECTED]> wrote:
> On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
> > (Query esp. Adrian):
> >
> > If you are indexing XHTML, do you replace tags with entities before
> > giving it to solr, if so, when you get back snippets do you get tags
> > or entities or do you convert again to tags for presentation?  What's
> > the best way out?  It would help me a lot if you briefly explain your
> > configuration.
>
> We happen to develop a HTML editor so we know 100% for certain that
> the XHTML is valid XML. Given that we just throw the raw XHTML at
> Solr which uses the HTMLStripWhitespaceTokenizer. However, at this
> stage we haven't configured highlighting at all, so our index is used
> for search and retrieving a document ID. At some point I'd like to
> add highlighting and it sounds like the best way to do so would be to
> index the document text instead of the HTML.
>
> Beyond that, we also use Solr as an optimization for extracting
> information such as what content was most recently changed, which
> pages link to others etc. On the page linking, we actually identify
> what pages are linked to prior to indexing and store them as a
> separate field - Solr itself has no understanding of the linking.
>
> Oh and I should note, I'm very new to Solr so I'm probably not doing
> things the best way, but I'm getting great results anyway.
>
> Regards,
>
> Adrian Sutton
> http://www.symphonious.net
>
>

Reply via email to