maxAnalyzedChars did it! I wasn't setting that param, and I'm working with some very long documents. I also made the hl.fl param formatting change that you suggested, Aloke.
Thanks again! - Eric On Sep 11, 2013, at 3:10 AM, Eric O'Hanlon <elo2...@columbia.edu> wrote: > Thank you, Aloke and Bryan! I'll give this a try and I'll report back on > what happens! > > - Eric > > On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal <alghos...@gmail.com> wrote: > >> Hi Eric, >> >> As Bryan suggests, you should look at appropriately setting up the >> fragSize & maxAnalyzedChars for long documents. >> >> One issue I find with your search request is that in trying to >> highlight across three separate fields, you have added each of them as >> a separate request param: >> hl.fl=contents&hl.fl=title&hl.fl=original_url >> >> The way to do it would be >> (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass >> them as values to one comma (or space) separated field: >> hl.fl=contents,title,original_url >> >> Regards, >> Aloke >> >> On 9/9/13, Bryan Loofbourrow <bloofbour...@knowledgemosaic.com> wrote: >>> Eric, >>> >>> Your example document is quite long. Are you setting hl.maxAnalyzedChars? >>> If you don't, the highlighter you appear to be using will not look past >>> the first 51,200 characters of the document for snippet candidates. >>> >>> http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars >>> >>> -- Bryan >>> >>> >>>> -----Original Message----- >>>> From: Eric O'Hanlon [mailto:elo2...@columbia.edu] >>>> Sent: Sunday, September 08, 2013 2:01 PM >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: Some highlighted snippets aren't being returned >>>> >>>> Hi again Everyone, >>>> >>>> I didn't get any replies to this, so I thought I'd re-send in case >>> anyone >>>> missed it and has any thoughts. >>>> >>>> Thanks, >>>> Eric >>>> >>>> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon <elo2...@columbia.edu> wrote: >>>> >>>>> Hi Everyone, >>>>> >>>>> I'm facing an issue in which my solr query is returning highlighted >>>> snippets for some, but not all results. For reference, I'm searching >>>> through an index that contains web crawls of human-rights-related >>>> websites. I'm running solr as a webapp under Tomcat and I've included >>> the >>>> query's solr params from the Tomcat log: >>>>> >>>>> ... >>>>> webapp=/solr-4.2 >>>>> path=/select >>>>> >>>> >>> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.m >>>> >>> imetype_code.facet.limit=7&hl.simple.pre=<code>&q.alt=*:*&f.organization_t >>>> >>> ype__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of >>>> >>> _capture_yyyy.facet.limit=6&group.field=original_url&hl.simple.post=</code >>>> >>>> &facet.field=domain&facet.field=date_of_capture_yyyy&facet.field=mimetype >>>> >>> _code&facet.field=geographic_focus__facet&facet.field=organization_based_i >>>> >>> n__facet&facet.field=organization_type__facet&facet.field=language__facet& >>>> >>> facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.face >>>> >>> t.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=orig >>>> >>> inal_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&r >>>> >>> ows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.fac >>>> et.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8 >>>> status=0 QTime=108 >>>>> ... >>>>> >>>>> For the query above (which can be simplified to say: find all >>> documents >>>> that contain the word "unangan" and return facets, highlights, etc.), I >>>> get five search results. Only three of these are returning highlighted >>>> snippets. Here's the "highlighting" portion of the solr response (note: >>>> printed in ruby notation because I'm receiving this response in a Rails >>>> app): >>>>> >>>>> -------- >>>>> "highlighting"=> >>>>> >>>> >>> {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun% >>>> 202002%20tentang%20Perlindungan%20Anak.pdf"=> >>>>> {}, >>>>> >>>> >>> "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 >>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=> >>>>> {}, >>>>> >>>> >>> "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 >>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=> >>>>> {}, >>>>> "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=> >>>>> {"contents"=> >>>>> ["...actual snippet is returned here..."]}, >>>>> "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=> >>>>> {"contents"=> >>>>> ["...actual snippet is returned here..."]}, >>>>> "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2- >>>> uu-no-39-tahun-1999"=> >>>>> {"contents"=> >>>>> ["...actual snippet is returned here..."]}, >>>>> >>> "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no- >>>> 39-tahun-1999?tmpl=component&format=raw"=> >>>>> {"contents"=> >>>>> ["...actual snippet is returned here..."]}, >>>>> >>>> >>> "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U >>>> timut_heritage.pdf"=> >>>>> {}} >>>>> -------- >>>>> >>>>> I have eight (as opposed to five) results above because I'm also doing >>> a >>>> grouped query, grouping by a field called "original_url", and this leads >>>> to five grouped results. >>>>> >>>>> I've confirmed that my highlight-lacking results DO contain the word >>>> "unangan", as expected, and this term is appearing in a text field >>> that's >>>> indexed and stored, and being searched for all text searches. For >>>> example, one of the search results is for a crawl of this document: >>>> >>> http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p >>>> df >>>>> >>>>> And if you view that document on the web, you'll see that it does >>>> contain "unangan". >>>>> >>>>> Has anyone seen this before? And does anyone have any good >>> suggestions >>>> for troubleshooting/fixing the problem? >>>>> >>>>> Thanks! >>>>> >>>>> - Eric >>> >> > >