I'm seeing strange hl.fragsize behavior in the version of Solr 4.6.0,
the version I happen to be using.

I've been testing with this "mp500.xml" file...

http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_6_0/solr/example/exampledocs/mp500.xml?view=markup

... using the query "q=indication" and I get some highlights:

```
$ curl -s 
"http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication";
| jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      ", Battery level <em>indication</em>"
    ]
  }
}
```

Great! I got a highlight snippet back! But what if I start playing
with "fragsize"? According to
https://wiki.apache.org/solr/HighlightingParameters#hl.fragsize ,
fragsize=0 should give me the "whole field value should be used with
no fragmenting." And it does:

```
$ curl -s 
"http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=0";
| jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      "Notes, Calendar, Phone book, Hold button, Date display, Photo
wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB
2.0 compatibility, Playback speed control, Rechargeable capability,
Battery level <em>indication</em>"
    ]
  }
}
```

As the docs indicate, fragsize=100 is the default and gives me the
same results as we saw above when we left out fragsize:

```
$ curl -s 
"http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=100";
| jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      ", Battery level <em>indication</em>"
    ]
  }
}
```

But wait a minute... fragsize is defined as "the size, in characters,
of the snippets (aka fragments) created by the highlighter". Is that
really 100 characters? More like 27 if I strip out the HTML tags:

```
$ echo -n ", Battery level <em>indication</em>" | awk '{gsub("<[^>]*>", "")}1'
, Battery level indication
$ echo -n ", Battery level <em>indication</em>" | awk
'{gsub("<[^>]*>", "")}1' | wc -c
      27
```

So that's weird. I ask for 100 characters but only get 27?

Let's try asking for 110 characters:

```
$ curl -s 
"http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=110";
| jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      ", Upgradeable firmware, USB 2.0 compatibility, Playback speed
control, Rechargeable capability, Battery level <em>indication</em>"
    ]
  }
}
```

That's better. With fragsize=110 we got back a snippet of 121
characters that time. But why did we only get back 27 characters from
fragsize=100?

Here's something else that's strange. With fragsize=120 I get back
*fewer* characters than fragsize=110. Only 108 characters back rather
than 121:

```
$ curl -s 
"http://localhost:8983/solr/collection1/select?wt=json&indent=true&hl=true&hl.fl=*&q=indication&hl.fragsize=120";
| jq '.highlighting'
{
  "MA147LL/A": {
    "features": [
      " firmware, USB 2.0 compatibility, Playback speed control,
Rechargeable capability, Battery level <em>indication</em>"
    ]
  }
}
```

As I increase the fragsize shouldn't I get *more* characters back? And
again, why do I only get 27 characters back from fragsize=100?

I'm concerned about this because my fix for
https://github.com/IQSS/dataverse/issues/2191 is to make fragsize
configurable, but I'm getting such unexpected results playing with
different fragsize values I'm losing faith in it. We use highlighting
heavily to indicate where in the document a query matched. To be
clear, I haven't lost faith in Solr itself. It's a great project. I'm
just trying to understand what's going on above.

Any advice is welcome!

Phil

p.s. In case it's more readable, I also posted this (long) email as a
gist: https://gist.github.com/pdurbin/1a7b55e5714b7424fa94

-- 
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Reply via email to