Re: Wrong highlighting in stripped HTML field

Alan Woodward Thu, 08 Sep 2016 03:49:05 -0700

Hi, see https://issues.apache.org/jira/browse/SOLR-4686 
<https://issues.apache.org/jira/browse/SOLR-4686> - this is an ongoing point of 
contention!


Alan Woodward
www.flax.co.uk


> On 8 Sep 2016, at 09:38, Duck Geraint (ext) GBJH <geraint.d...@syngenta.com> 
> wrote:
> 
> As far as I can tell, that is how it's currently set-up (does the same on 
> mine at least). The HTML Stripper seems to exclude the pre tag, but include 
> the post tag when it generates the start and end offsets of each text token. 
> I couldn't say why though... (This may just have avoided needing to 
> backtrack).
> 
> Play around in the analysis section of the admin ui to verify this.
> 
> Geraint
> 
> 
> -----Original Message-----
> From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de]
> Sent: 07 September 2016 18:16
> To: solr-user@lucene.apache.org
> Subject: AW: Wrong highlighting in stripped HTML field
> 
> Hello,
> can anyone confirm this behavior of the highlighter? Otherwise my Solr 
> installation might be misconfigured or something.
> Or does anyone know if this is a known issue? In that case I probably should 
> ask on the dev mailing list.
> 
> Thanks and cheers,
> Dennis
> 
> 
> ________________________________________
> Von: Neumann, Dennis [neum...@sub.uni-goettingen.de]
> Gesendet: Montag, 5. September 2016 18:00
> An: solr-user@lucene.apache.org
> Betreff: Wrong highlighting in stripped HTML field
> 
> Hi guys
> 
> I am having a problem with the standard highlighter. I'm working with Solr 
> 5.4.1. The problem appears in my project, but it is easy to replicate:
> 
> I create a new core with the conf directory from configsets/basic_configs, so 
> everything is set to defaults. I add the following in schema.xml:
> 
> 
>    <field name="testfield" type="mytype" indexed="true" stored="true" 
> required="false" multiValued="false" />
> 
>    <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <charFilter class="solr.HTMLStripCharFilterFactory" />
>        <tokenizer class="solr.StandardTokenizerFactory" />
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory" />
>      </analyzer>
>    </fieldType>
> 
> 
> Now I add this document (in the admin interface):
> 
> {"id":"1","testfield":"<span>bla</span>"}
> 
> I search for: testfield:bla
> with hl=on&hl.fl=testfield
> 
> What I get is a response with an incorrectly formatted HTML snippet:
> 
> 
>  "response": {
>    "numFound": 1,
>    "start": 0,
>    "docs": [
>      {
>        "id": "1",
>        "testfield": "<span>bla</span>",
>        "_version_": 1544645963570741200
>      }
>    ]
>  },
>  "highlighting": {
>    "1": {
>      "testfield": [
>        "<span><em>bla</span></em>"
>      ]
>    }
>  }
> 
> Is there a way to tell the highlighter to just enclose the "bla"? I. e. I 
> want to get
> 
> <span><em>bla</em></span>
> 
> 
> Best regards
> Dennis
> 
> 
> ________________________________
> 
> 
> Syngenta Limited, Registered in England No 2710846; Registered Office : 
> Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, 
> RG42 6EY, United Kingdom
> ________________________________
> This message may contain confidential information. If you are not the 
> designated recipient, please notify the sender immediately, and delete the 
> original and any copies. Any use of the message by you is prohibited.

Re: Wrong highlighting in stripped HTML field

Reply via email to