On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
(Query esp. Adrian):
If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation? What's
the best way out? It would help me a lot if you briefly explain your
configuration.
We happen to develop a HTML editor so we know 100% for certain that
the XHTML is valid XML. Given that we just throw the raw XHTML at
Solr which uses the HTMLStripWhitespaceTokenizer. However, at this
stage we haven't configured highlighting at all, so our index is used
for search and retrieving a document ID. At some point I'd like to
add highlighting and it sounds like the best way to do so would be to
index the document text instead of the HTML.
Beyond that, we also use Solr as an optimization for extracting
information such as what content was most recently changed, which
pages link to others etc. On the page linking, we actually identify
what pages are linked to prior to indexing and store them as a
separate field - Solr itself has no understanding of the linking.
Oh and I should note, I'm very new to Solr so I'm probably not doing
things the best way, but I'm getting great results anyway.
Regards,
Adrian Sutton
http://www.symphonious.net