I have highlighting working on my project (indexing content for a web app), but the idea of highlighting with <em> tags doesn't make sense to me. It seems that it opens up the system to XSS attacks if you echo search result data (with highlights) into a web page.
Example: Index the following string: example of malicious script: <script>alert(1)</script> Now when I fetch this document from Solr, I will escape it before outputting, it, giving me: example of malicious script: <script&rt;alert(1)</script> But if I turn highlighting on and the highlight is the <em> tag, then when I search for the word "example" I would get: <em&rt;example</em&rt; of malicious script: <script&rt;alert(1)</script> When a browser displays this, it will literally print <em> tags around the word "example" instead of actually visually emphasizing the word. Now then, I could escape the text before indexing, but then Solr's index would include words like "lt", "rt", and "amp". I can't put these words on the stopword list because "amp" is a real word that a user might want to search for. Any errors in my logic? The only thing I can think to do is to change the highlight "pre" and "post" to some non-HTML string and then parse the response to replace those with correct HTML tags. But that's definitely hacky. Thanks, Mark