On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.

We happen to develop a HTML editor so we know 100% for certain that the XHTML is valid XML. Given that we just throw the raw XHTML at Solr which uses the HTMLStripWhitespaceTokenizer. However, at this stage we haven't configured highlighting at all, so our index is used for search and retrieving a document ID. At some point I'd like to add highlighting and it sounds like the best way to do so would be to index the document text instead of the HTML.

Beyond that, we also use Solr as an optimization for extracting information such as what content was most recently changed, which pages link to others etc. On the page linking, we actually identify what pages are linked to prior to indexing and store them as a separate field - Solr itself has no understanding of the linking.

Oh and I should note, I'm very new to Solr so I'm probably not doing things the best way, but I'm getting great results anyway.

Regards,

Adrian Sutton
http://www.symphonious.net

Reply via email to