Hi,
I'm afraid I've found another slightly odd thing with Highlighting, in this case in a
multi-valued field I'm using for author names.
The author names are typically Surname, initials (e.g. May, A.D.), and these are the kind
of results I'm getting:
authors:Buxton
<?xml version="1.0" encoding="UTF-8"?>
<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>
<result numFound="2" start="0">
<doc>
<arr name="authors"><str>Duncan, W.I.</str><str>Buxton, N.W.K.</str></arr>
</doc>
<doc>
<arr name="authors"><str>Buxton, M.W.N.</str><str>Pedley, H.M.</str></arr>
</doc>
</result>
<lst name="highlighting">
<lst name="geol/jgs/1995/00000152/00000002/15220251">
<arr name="authors">
<str>.<em>Buxton</em>, N.W.K</str>
</arr>
</lst>
<lst name="geol/jgs/1989/00000146/00000005/14650746">
<arr name="authors">
<str><em>Buxton</em>, M.W.N</str>
</arr>
</lst>
</lst>
</response>
So in the first case, where the second author name was matched, the final period has
disappeared, and there's a stray period at the start. In the second case where the first
author name was matched, the final period is also missing, but there's no extra period at
the start.
This pattern is the same for other author searches, which suggests that it's picking up
the last character from the previous field and returning that at the start, and loosing
the last character.
However, some searches on keywords (also multi-valued) seem to suggest that it's not that
simple:
keywords:rock (with maxSnippets=100)
<?xml version="1.0" encoding="UTF-8"?>
<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>
<result numFound="18" start="0">
<doc>
<arr name="keywords"><str>fracture (rock)</str><str>porosity
(rock)</str><str>permeability (rock)</str>
<str>nuclear magnetic resonance</str></arr>
</doc>
<doc>
<arr name="keywords"><str>United Kingdom</str><str>Carboniferous</str><str>clastie
rocks</str>
<str>coal seams</str><str>sedimentary rocks</str></arr>
</doc>
</result>
<lst name="highlighting">
<lst name="geol/pg/2002/00000008/00000003/art00001">
<arr name="keywords">
<str>fracture (<em>rock</em></str>
<str>)porosity (<em>rock</em></str>
<str>)permeability (<em>rock</em></str>
</arr>
</lst>
<lst name="geol/jgs/1995/00000152/00000005/15250819">
<arr name="keywords">
<str>clastie <em>rocks</em></str>
<str>sedimentary <em>rocks</em></str>
</arr>
</lst>
</lst>
</response>
The first document seems to have the same behaviour as the authors searching, but the
second one where there's no punctuation, there's no missing/moved characters (as far as I
can tell this seems to be true whether the highlight is at the start/end of the value, or
in the middle).
Any thoughts? Let me know if I should open a JIRA issue.
Thanks,
Andrew