Re: Slow Highlighter Performance Even Using FastVectorHighlighter

Floyd Wu Mon, 17 Jun 2013 06:40:29 -0700

Hi Michael, How do I configure posthighlighter with my solr 4.2 box?
Please kindly point me. Many thanks.
2013/6/15 下午10:48 於 "Michael McCandless" <luc...@mikemccandless.com> 寫道：


> You could also try the new[ish] PostingsHighlighter:
>
> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
> <msoko...@safaribooksonline.com> wrote:
> > If you have very large documents (many MB) that can lead to slow
> > highlighting, even with FVH.
> >
> > See https://issues.apache.org/jira/browse/LUCENE-3234
> >
> > and try setting phraseLimit=1 (or some bigger number, but not infinite,
> > which is the default)
> >
> > -Mike
> >
> >
> >
> > On 6/14/13 4:52 PM, Andy Brown wrote:
> >>
> >> Bryan,
> >>
> >> For specifics, I'll refer you back to my original email where I
> >> specified all the fields/field types/handlers I use. Here's a general
> >> overview.
> >>   I really only have 3 fields that I index and search against: "name",
> >> "description", and "content". All of which are just general text
> >> (string) fields. I have a catch-all field called "text" that is only
> >> used for querying. It's indexed but not stored. The "name",
> >> "description", and "content" fields are copied into the "text" field.
> >>   For partial word matching, I have 4 more fields: "name_par",
> >> "description_par", "content_par", and "text_par". The "text_par" field
> >> has the same relationship to the "*_par" fields as "text" does to the
> >> others (only used for querying). Those partial word matching fields are
> >> of type "text_general_partial" which I created. That field type is
> >> analyzed different than the regular text field in that it goes through
> >> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> >> at index time.
> >>   I query against both "text" and "text_par" fields using edismax
> deftype
> >> with my qf set to "text^2 text_par^1" to give full word matches a higher
> >> score. This part returns back very fast as previously stated. It's when
> >> I turn on highlighting that I take the huge performance hit.
> >>   Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> >> name_par description description_par content content_par" so that it
> >> returns highlights for full and partial word matches. All of those
> >> fields have indexed, stored, termPositions, termVectors, and termOffsets
> >> set to "true".
> >>   It all seems redundant just to allow for partial word
> >> matching/highlighting but I didn't know of a better way. Does anything
> >> stand out to you that could be the culprit? Let me know if you need any
> >> more clarification.
> >>   Thanks!
> >>   - Andy
> >>
> >> -----Original Message-----
> >> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> >> Sent: Wednesday, May 29, 2013 5:44 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: RE: Slow Highlighter Performance Even Using
> >> FastVectorHighlighter
> >>
> >> Andy,
> >>
> >>> I don't understand why it's taking 7 secs to return highlights. The
> >>
> >> size
> >>>
> >>> of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> >>
> >> to
> >>>
> >>> 1024 for this verification purpose and that should be more than
> >>
> >> enough.
> >>>
> >>> The processor is plenty powerful enough as well.
> >>>
> >>> Running VisualVM shows all my CPU time being taken by mainly these 3
> >>> methods:
> >>>
> >>>
> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> >>>
> >>> nfo.getStartOffset()
> >>>
> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> >>>
> >>> nfo.getStartOffset()
> >>>
> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> >>>
> >>> )
> >>
> >> That is a strange and interesting set of things to be spending most of
> >> your CPU time on. The implication, I think, is that the number of term
> >> matches in the document for terms in your query (or, at least, terms
> >> matching exact words or the beginning of phrases in your query) is
> >> extremely high . Perhaps that's coming from this "partial word match"
> >> you
> >> mention -- how does that work?
> >>
> >> -- Bryan
> >>
> >>> My guess is that this has something to do with how I'm handling
> >>
> >> partial
> >>>
> >>> word matches/highlighting. I have setup another request handler that
> >>> only searches the whole word fields and it returns in 850 ms with
> >>> highlighting.
> >>>
> >>> Any ideas?
> >>>
> >>> - Andy
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> >>> Sent: Monday, May 20, 2013 1:39 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: RE: Slow Highlighter Performance Even Using
> >>> FastVectorHighlighter
> >>>
> >>> My guess is that the problem is those 200M documents.
> >>> FastVectorHighlighter is fast at deciding whether a match, especially
> >>
> >> a
> >>>
> >>> phrase, appears in a document, but it still starts out by walking the
> >>> entire list of term vectors, and ends by breaking the document into
> >>> candidate-snippet fragments, both processes that are proportional to
> >>
> >> the
> >>>
> >>> length of the document.
> >>>
> >>> It's hard to do much about the first, but for the second you could
> >>> choose
> >>> to expose FastVectorHighlighter's FieldPhraseList representation, and
> >>> return offsets to the caller rather than fragments, building up your
> >>
> >> own
> >>>
> >>> snippets from a separate store of indexed files. This would also
> >>
> >> permit
> >>>
> >>> you to set stored="false", improving your memory/core size ratio,
> >>
> >> which
> >>>
> >>> I'm guessing could use some improving. It would require some work, and
> >>> it
> >>> would require you to store a representation of what was indexed
> >>
> >> outside
> >>>
> >>> the Solr core, in some constant-bytes-to-character representation that
> >>> you
> >>> can use offsets with (e.g. UTF-16, or ASCII+entity references).
> >>>
> >>> However, you may not need to do this -- it may be that you just need
> >>> more
> >>> memory for your search machine. Not JVM memory, but memory that the
> >>
> >> O/S
> >>>
> >>> can use as a file cache. What do you have now? That is, how much
> >>
> >> memory
> >>>
> >>> do
> >>> you have that is not used by the JVM or other apps, and how big is
> >>
> >> your
> >>>
> >>> Solr core?
> >>>
> >>> One way to start getting a handle on where time is being spent is to
> >>
> >> set
> >>>
> >>> up VisualVM. Turn on CPU sampling, send in a bunch of the slow
> >>
> >> highlight
> >>>
> >>> queries, and look at where the time is being spent. If it's mostly in
> >>> methods that are just reading from disk, buy more memory. If you're on
> >>> Linux, look at what top is telling you. If the CPU usage is low and
> >>
> >> the
> >>>
> >>> "wa" number is above 1% more often than not, buy more memory (I don't
> >>> know
> >>> why that wa number makes sense, I just know that it has been a good
> >>
> >> rule
> >>>
> >>> of thumb for us).
> >>>
> >>> -- Bryan
> >>>
> >>>> -----Original Message-----
> >>>> From: Andy Brown [mailto:andy_br...@rhoworld.com]
> >>>> Sent: Monday, May 20, 2013 9:53 AM
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Slow Highlighter Performance Even Using
> >>
> >> FastVectorHighlighter
> >>>>
> >>>> I'm providing a search feature in a web app that searches for
> >>>
> >>> documents
> >>>>
> >>>> that range in size from 1KB to 200MB of varying MIME types (PDF,
> >>
> >> DOC,
> >>>>
> >>>> etc). Currently there are about 3000 documents and this will
> >>
> >> continue
> >>>
> >>> to
> >>>>
> >>>> grow. I'm providing full word search and partial word search. For
> >>
> >> each
> >>>>
> >>>> document, there are three source fields that I'm interested in
> >>>
> >>> searching
> >>>>
> >>>> and highlighting on: name, description, and content. Since I'm
> >>>
> >>> providing
> >>>>
> >>>> both full and partial word search, I've created additional fields
> >>
> >> that
> >>>>
> >>>> get tokenized differently: name_par, description_par, and
> >>
> >> content_par.
> >>>>
> >>>> Those are indexed and stored as well for querying and highlighting.
> >>
> >> As
> >>>>
> >>>> suggested in the Solr wiki, I've got two catch all fields text and
> >>>> text_par for faster querying.
> >>>>
> >>>> An average search results page displays 25 results and I provide
> >>>
> >>> paging.
> >>>>
> >>>> I'm just returning the doc ID in my Solr search results and response
> >>>> times have been quite good (1 to 10 ms). The problem in performance
> >>>> occurs when I turn on highlighting. I'm already using the
> >>>> FastVectorHighlighter and depending on the query, it has taken as
> >>
> >> long
> >>>>
> >>>> as 15 seconds to get the highlight snippets. However, this isn't
> >>>
> >>> always
> >>>>
> >>>> the case. Certain query terms result in 1 sec or less response time.
> >>>
> >>> In
> >>>>
> >>>> any case, 15 seconds is way too long.
> >>>>
> >>>> I'm fairly new to Solr but I've spent days coming up with what I've
> >>>
> >>> got
> >>>>
> >>>> so far. Feel free to correct any misconceptions I have. Can anyone
> >>>> advise me on what I'm doing wrong or offer a better way to setup my
> >>>
> >>> core
> >>>>
> >>>> to improve highlighting performance?
> >>>>
> >>>> A typical query would look like:
> >>>> /select?q=foo&start=0&rows=25&fl=id&hl=true
> >>>>
> >>>> I'm using Solr 4.1. Below the relevant core schema and config
> >>
> >> details:
> >>>>
> >>>> <!-- Misc fields -->
> >>>> <field name="_version_" type="long" indexed="true" stored="true"/>
> >>>> <field name="id" type="string" indexed="true" stored="true"
> >>>> required="true" multiValued="false"/>
> >>>>
> >>>>
> >>>> <!-- Fields for whole word matches -->
> >>>> <field name="name" type="text_general" indexed="true" stored="true"
> >>>> multiValued="true" termPositions="true" termVectors="true"
> >>>> termOffsets="true"/>
> >>>> <field name="description" type="text_general" indexed="true"
> >>>> stored="true" multiValued="true" termPositions="true"
> >>>
> >>> termVectors="true"
> >>>>
> >>>> termOffsets="true"/>
> >>>> <field name="content" type="text_general" indexed="true"
> >>
> >> stored="true"
> >>>>
> >>>> multiValued="true" termPositions="true" termVectors="true"
> >>>> termOffsets="true"/>
> >>>> <field name="text" type="text_general" indexed="true" stored="false"
> >>>> multiValued="true"/>
> >>>>
> >>>> <!-- Fields for partial word matches -->
> >>>> <field name="name_par" type="text_general_partial" indexed="true"
> >>>> stored="true" multiValued="true" termPositions="true"
> >>>
> >>> termVectors="true"
> >>>>
> >>>> termOffsets="true"/>
> >>>> <field name="description_par" type="text_general_partial"
> >>>
> >>> indexed="true"
> >>>>
> >>>> stored="true" multiValued="true" termPositions="true"
> >>>
> >>> termVectors="true"
> >>>>
> >>>> termOffsets="true"/>
> >>>> <field name="content_par" type="text_general_partial" indexed="true"
> >>>> stored="true" multiValued="true" termPositions="true"
> >>>
> >>> termVectors="true"
> >>>>
> >>>> termOffsets="true"/>
> >>>> <field name="text_par" type="text_general_partial" indexed="true"
> >>>> stored="false" multiValued="true"/>
> >>>>
> >>>>
> >>>> <!-- Copy source name, description, and content fields to name_par,
> >>>> description_par, and content_par for partial word searches -->
> >>>> <copyField source="name" dest="name_par"/>
> >>>> <copyField source="description" dest="description_par"/>
> >>>> <copyField source="content" dest="content_par"/>
> >>>>
> >>>> <!-- Copy source name, description, and content fields to catch-all
> >>>
> >>> text
> >>>>
> >>>> field for faster querying. -->
> >>>> <copyField source="name" dest="text"/>
> >>>> <copyField source="description" dest="text"/>
> >>>> <copyField source="content" dest="text"/>
> >>>>
> >>>> <!-- Copy source name, description, and content fields to catch-all
> >>>> text_par field for faster querying of partial word searches. -->
> >>>> <copyField source="name" dest="text_par"/>
> >>>> <copyField source="description" dest="text_par"/>
> >>>> <copyField source="content" dest="text_par"/>
> >>>>
> >>>> <!-- A text field for whole word matches -->
> >>>> <fieldType name="text_general" class="solr.TextField"
> >>>> positionIncrementGap="100">
> >>>>    <analyzer type="index">
> >>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>> words="stopwords.txt" enablePositionIncrements="true" />
> >>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>    </analyzer>
> >>>>    <analyzer type="query">
> >>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>> words="stopwords.txt" enablePositionIncrements="true" />
> >>>>      <filter class="solr.SynonymFilterFactory"
> >>
> >> synonyms="synonyms.txt"
> >>>>
> >>>> ignoreCase="true" expand="true"/>
> >>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>     </analyzer>
> >>>>   </fieldType>
> >>>>
> >>>> <!-- A text field for parital matches -->
> >>>> <fieldType name="text_general_partial" class="solr.TextField"
> >>>> positionIncrementGap="100">
> >>>>    <analyzer type="index">
> >>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>> words="stopwords.txt" enablePositionIncrements="true" />
> >>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> >>>> maxGramSize="7"/>
> >>>>    </analyzer>
> >>>>    <analyzer type="query">
> >>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>> words="stopwords.txt" enablePositionIncrements="true" />
> >>>>      <filter class="solr.SynonymFilterFactory"
> >>
> >> synonyms="synonyms.txt"
> >>>>
> >>>> ignoreCase="true" expand="true"/>
> >>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>    </analyzer>
> >>>> </fieldType>
> >>>>
> >>>>
> >>>>
> >>>> <requestHandler name="/select" class="solr.SearchHandler">
> >>>>      <!-- default values for query parameters can be specified, these
> >>>> will be overridden by parameters in the request. -->
> >>>>       <lst name="defaults">
> >>>>         <str name="echoParams">explicit</str>
> >>>>         <int name="rows">10</int>
> >>>>         <str name="df">text</str>
> >>>>            <str name="defType">edismax</str>
> >>>>            <str name="qf">text^2 text_par^1</str>   <!-- Boost whole
> >>>> word matches more than partial matches in the scroing. -->
> >>>>            <bool name="termVectors">true</bool>
> >>>>         <bool name="termPositions">true</bool>
> >>>>         <bool name="termOffsets">true</bool>
> >>>>         <bool name="hl.useFastVectorHighlighter">true</bool>
> >>>>         <str name="hl.boundaryScanner">breakIterator</str>
> >>>>         <str name="hl.snippets">2</str>
> >>>>            <str name="hl.fl">name name_par description description_par
> >>>> content content_par</str>
> >>>>         <int name="hl.fragsize">162</int>
> >>>>            <str name="hl.fragListBuilder">simple</str>
> >>>>         <str name="hl.fragmentsBuilder">default</str>
> >>>>         <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
> >>>>         <str name="hl.simple.post"><![CDATA[</strong>]]></str>
> >>>>            <str name="hl.tag.pre"><![CDATA[<strong>]]></str>
> >>>>         <str name="hl.tag.post"><![CDATA[</strong>]]></str>
> >>>>       </lst>
> >>>>   </requestHandler>
> >>>>
> >>>>
> >>>> Cheers!
> >>>>
> >>>> - Andy
> >
> >
>

Re: Slow Highlighter Performance Even Using FastVectorHighlighter

Reply via email to