Hi, see https://issues.apache.org/jira/browse/SOLR-4686 <https://issues.apache.org/jira/browse/SOLR-4686> - this is an ongoing point of contention!
Alan Woodward www.flax.co.uk > On 8 Sep 2016, at 09:38, Duck Geraint (ext) GBJH <geraint.d...@syngenta.com> > wrote: > > As far as I can tell, that is how it's currently set-up (does the same on > mine at least). The HTML Stripper seems to exclude the pre tag, but include > the post tag when it generates the start and end offsets of each text token. > I couldn't say why though... (This may just have avoided needing to > backtrack). > > Play around in the analysis section of the admin ui to verify this. > > Geraint > > > -----Original Message----- > From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de] > Sent: 07 September 2016 18:16 > To: solr-user@lucene.apache.org > Subject: AW: Wrong highlighting in stripped HTML field > > Hello, > can anyone confirm this behavior of the highlighter? Otherwise my Solr > installation might be misconfigured or something. > Or does anyone know if this is a known issue? In that case I probably should > ask on the dev mailing list. > > Thanks and cheers, > Dennis > > > ________________________________________ > Von: Neumann, Dennis [neum...@sub.uni-goettingen.de] > Gesendet: Montag, 5. September 2016 18:00 > An: solr-user@lucene.apache.org > Betreff: Wrong highlighting in stripped HTML field > > Hi guys > > I am having a problem with the standard highlighter. I'm working with Solr > 5.4.1. The problem appears in my project, but it is easy to replicate: > > I create a new core with the conf directory from configsets/basic_configs, so > everything is set to defaults. I add the following in schema.xml: > > > <field name="testfield" type="mytype" indexed="true" stored="true" > required="false" multiValued="false" /> > > <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <charFilter class="solr.HTMLStripCharFilterFactory" /> > <tokenizer class="solr.StandardTokenizerFactory" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory" /> > </analyzer> > </fieldType> > > > Now I add this document (in the admin interface): > > {"id":"1","testfield":"<span>bla</span>"} > > I search for: testfield:bla > with hl=on&hl.fl=testfield > > What I get is a response with an incorrectly formatted HTML snippet: > > > "response": { > "numFound": 1, > "start": 0, > "docs": [ > { > "id": "1", > "testfield": "<span>bla</span>", > "_version_": 1544645963570741200 > } > ] > }, > "highlighting": { > "1": { > "testfield": [ > "<span><em>bla</span></em>" > ] > } > } > > Is there a way to tell the highlighter to just enclose the "bla"? I. e. I > want to get > > <span><em>bla</em></span> > > > Best regards > Dennis > > > ________________________________ > > > Syngenta Limited, Registered in England No 2710846; Registered Office : > Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, > RG42 6EY, United Kingdom > ________________________________ > This message may contain confidential information. If you are not the > designated recipient, please notify the sender immediately, and delete the > original and any copies. Any use of the message by you is prohibited.