Thanks Adrian, I'm very new to Solr myself so struggling a bit in initial stages...
One last one, when you send HTML to solr, do you too replace special chars and tags with named entities? I did this and HTMLStripper doesn't seem to recognise them the tags :-S While if I try and input HTML as is indexer throws exceptions (as having tags within XML tags is obviously not valid. How to do this part? Ravish On 10/5/07, Adrian Sutton <[EMAIL PROTECTED]> wrote: > On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote: > > (Query esp. Adrian): > > > > If you are indexing XHTML, do you replace tags with entities before > > giving it to solr, if so, when you get back snippets do you get tags > > or entities or do you convert again to tags for presentation? What's > > the best way out? It would help me a lot if you briefly explain your > > configuration. > > We happen to develop a HTML editor so we know 100% for certain that > the XHTML is valid XML. Given that we just throw the raw XHTML at > Solr which uses the HTMLStripWhitespaceTokenizer. However, at this > stage we haven't configured highlighting at all, so our index is used > for search and retrieving a document ID. At some point I'd like to > add highlighting and it sounds like the best way to do so would be to > index the document text instead of the HTML. > > Beyond that, we also use Solr as an optimization for extracting > information such as what content was most recently changed, which > pages link to others etc. On the page linking, we actually identify > what pages are linked to prior to indexing and store them as a > separate field - Solr itself has no understanding of the linking. > > Oh and I should note, I'm very new to Solr so I'm probably not doing > things the best way, but I'm getting great results anyway. > > Regards, > > Adrian Sutton > http://www.symphonious.net > >