Hi Michael, How do I configure posthighlighter with my solr 4.2 box? Please kindly point me. Many thanks. 2013/6/15 下午10:48 於 "Michael McCandless" <luc...@mikemccandless.com> 寫道:
> You could also try the new[ish] PostingsHighlighter: > > http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov > <msoko...@safaribooksonline.com> wrote: > > If you have very large documents (many MB) that can lead to slow > > highlighting, even with FVH. > > > > See https://issues.apache.org/jira/browse/LUCENE-3234 > > > > and try setting phraseLimit=1 (or some bigger number, but not infinite, > > which is the default) > > > > -Mike > > > > > > > > On 6/14/13 4:52 PM, Andy Brown wrote: > >> > >> Bryan, > >> > >> For specifics, I'll refer you back to my original email where I > >> specified all the fields/field types/handlers I use. Here's a general > >> overview. > >> I really only have 3 fields that I index and search against: "name", > >> "description", and "content". All of which are just general text > >> (string) fields. I have a catch-all field called "text" that is only > >> used for querying. It's indexed but not stored. The "name", > >> "description", and "content" fields are copied into the "text" field. > >> For partial word matching, I have 4 more fields: "name_par", > >> "description_par", "content_par", and "text_par". The "text_par" field > >> has the same relationship to the "*_par" fields as "text" does to the > >> others (only used for querying). Those partial word matching fields are > >> of type "text_general_partial" which I created. That field type is > >> analyzed different than the regular text field in that it goes through > >> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7" > >> at index time. > >> I query against both "text" and "text_par" fields using edismax > deftype > >> with my qf set to "text^2 text_par^1" to give full word matches a higher > >> score. This part returns back very fast as previously stated. It's when > >> I turn on highlighting that I take the huge performance hit. > >> Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name > >> name_par description description_par content content_par" so that it > >> returns highlights for full and partial word matches. All of those > >> fields have indexed, stored, termPositions, termVectors, and termOffsets > >> set to "true". > >> It all seems redundant just to allow for partial word > >> matching/highlighting but I didn't know of a better way. Does anything > >> stand out to you that could be the culprit? Let me know if you need any > >> more clarification. > >> Thanks! > >> - Andy > >> > >> -----Original Message----- > >> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] > >> Sent: Wednesday, May 29, 2013 5:44 PM > >> To: solr-user@lucene.apache.org > >> Subject: RE: Slow Highlighter Performance Even Using > >> FastVectorHighlighter > >> > >> Andy, > >> > >>> I don't understand why it's taking 7 secs to return highlights. The > >> > >> size > >>> > >>> of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set > >> > >> to > >>> > >>> 1024 for this verification purpose and that should be more than > >> > >> enough. > >>> > >>> The processor is plenty powerful enough as well. > >>> > >>> Running VisualVM shows all my CPU time being taken by mainly these 3 > >>> methods: > >>> > >>> > >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI > >>> > >>> nfo.getStartOffset() > >>> > >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI > >>> > >>> nfo.getStartOffset() > >>> > >> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( > >>> > >>> ) > >> > >> That is a strange and interesting set of things to be spending most of > >> your CPU time on. The implication, I think, is that the number of term > >> matches in the document for terms in your query (or, at least, terms > >> matching exact words or the beginning of phrases in your query) is > >> extremely high . Perhaps that's coming from this "partial word match" > >> you > >> mention -- how does that work? > >> > >> -- Bryan > >> > >>> My guess is that this has something to do with how I'm handling > >> > >> partial > >>> > >>> word matches/highlighting. I have setup another request handler that > >>> only searches the whole word fields and it returns in 850 ms with > >>> highlighting. > >>> > >>> Any ideas? > >>> > >>> - Andy > >>> > >>> > >>> -----Original Message----- > >>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] > >>> Sent: Monday, May 20, 2013 1:39 PM > >>> To: solr-user@lucene.apache.org > >>> Subject: RE: Slow Highlighter Performance Even Using > >>> FastVectorHighlighter > >>> > >>> My guess is that the problem is those 200M documents. > >>> FastVectorHighlighter is fast at deciding whether a match, especially > >> > >> a > >>> > >>> phrase, appears in a document, but it still starts out by walking the > >>> entire list of term vectors, and ends by breaking the document into > >>> candidate-snippet fragments, both processes that are proportional to > >> > >> the > >>> > >>> length of the document. > >>> > >>> It's hard to do much about the first, but for the second you could > >>> choose > >>> to expose FastVectorHighlighter's FieldPhraseList representation, and > >>> return offsets to the caller rather than fragments, building up your > >> > >> own > >>> > >>> snippets from a separate store of indexed files. This would also > >> > >> permit > >>> > >>> you to set stored="false", improving your memory/core size ratio, > >> > >> which > >>> > >>> I'm guessing could use some improving. It would require some work, and > >>> it > >>> would require you to store a representation of what was indexed > >> > >> outside > >>> > >>> the Solr core, in some constant-bytes-to-character representation that > >>> you > >>> can use offsets with (e.g. UTF-16, or ASCII+entity references). > >>> > >>> However, you may not need to do this -- it may be that you just need > >>> more > >>> memory for your search machine. Not JVM memory, but memory that the > >> > >> O/S > >>> > >>> can use as a file cache. What do you have now? That is, how much > >> > >> memory > >>> > >>> do > >>> you have that is not used by the JVM or other apps, and how big is > >> > >> your > >>> > >>> Solr core? > >>> > >>> One way to start getting a handle on where time is being spent is to > >> > >> set > >>> > >>> up VisualVM. Turn on CPU sampling, send in a bunch of the slow > >> > >> highlight > >>> > >>> queries, and look at where the time is being spent. If it's mostly in > >>> methods that are just reading from disk, buy more memory. If you're on > >>> Linux, look at what top is telling you. If the CPU usage is low and > >> > >> the > >>> > >>> "wa" number is above 1% more often than not, buy more memory (I don't > >>> know > >>> why that wa number makes sense, I just know that it has been a good > >> > >> rule > >>> > >>> of thumb for us). > >>> > >>> -- Bryan > >>> > >>>> -----Original Message----- > >>>> From: Andy Brown [mailto:andy_br...@rhoworld.com] > >>>> Sent: Monday, May 20, 2013 9:53 AM > >>>> To: solr-user@lucene.apache.org > >>>> Subject: Slow Highlighter Performance Even Using > >> > >> FastVectorHighlighter > >>>> > >>>> I'm providing a search feature in a web app that searches for > >>> > >>> documents > >>>> > >>>> that range in size from 1KB to 200MB of varying MIME types (PDF, > >> > >> DOC, > >>>> > >>>> etc). Currently there are about 3000 documents and this will > >> > >> continue > >>> > >>> to > >>>> > >>>> grow. I'm providing full word search and partial word search. For > >> > >> each > >>>> > >>>> document, there are three source fields that I'm interested in > >>> > >>> searching > >>>> > >>>> and highlighting on: name, description, and content. Since I'm > >>> > >>> providing > >>>> > >>>> both full and partial word search, I've created additional fields > >> > >> that > >>>> > >>>> get tokenized differently: name_par, description_par, and > >> > >> content_par. > >>>> > >>>> Those are indexed and stored as well for querying and highlighting. > >> > >> As > >>>> > >>>> suggested in the Solr wiki, I've got two catch all fields text and > >>>> text_par for faster querying. > >>>> > >>>> An average search results page displays 25 results and I provide > >>> > >>> paging. > >>>> > >>>> I'm just returning the doc ID in my Solr search results and response > >>>> times have been quite good (1 to 10 ms). The problem in performance > >>>> occurs when I turn on highlighting. I'm already using the > >>>> FastVectorHighlighter and depending on the query, it has taken as > >> > >> long > >>>> > >>>> as 15 seconds to get the highlight snippets. However, this isn't > >>> > >>> always > >>>> > >>>> the case. Certain query terms result in 1 sec or less response time. > >>> > >>> In > >>>> > >>>> any case, 15 seconds is way too long. > >>>> > >>>> I'm fairly new to Solr but I've spent days coming up with what I've > >>> > >>> got > >>>> > >>>> so far. Feel free to correct any misconceptions I have. Can anyone > >>>> advise me on what I'm doing wrong or offer a better way to setup my > >>> > >>> core > >>>> > >>>> to improve highlighting performance? > >>>> > >>>> A typical query would look like: > >>>> /select?q=foo&start=0&rows=25&fl=id&hl=true > >>>> > >>>> I'm using Solr 4.1. Below the relevant core schema and config > >> > >> details: > >>>> > >>>> <!-- Misc fields --> > >>>> <field name="_version_" type="long" indexed="true" stored="true"/> > >>>> <field name="id" type="string" indexed="true" stored="true" > >>>> required="true" multiValued="false"/> > >>>> > >>>> > >>>> <!-- Fields for whole word matches --> > >>>> <field name="name" type="text_general" indexed="true" stored="true" > >>>> multiValued="true" termPositions="true" termVectors="true" > >>>> termOffsets="true"/> > >>>> <field name="description" type="text_general" indexed="true" > >>>> stored="true" multiValued="true" termPositions="true" > >>> > >>> termVectors="true" > >>>> > >>>> termOffsets="true"/> > >>>> <field name="content" type="text_general" indexed="true" > >> > >> stored="true" > >>>> > >>>> multiValued="true" termPositions="true" termVectors="true" > >>>> termOffsets="true"/> > >>>> <field name="text" type="text_general" indexed="true" stored="false" > >>>> multiValued="true"/> > >>>> > >>>> <!-- Fields for partial word matches --> > >>>> <field name="name_par" type="text_general_partial" indexed="true" > >>>> stored="true" multiValued="true" termPositions="true" > >>> > >>> termVectors="true" > >>>> > >>>> termOffsets="true"/> > >>>> <field name="description_par" type="text_general_partial" > >>> > >>> indexed="true" > >>>> > >>>> stored="true" multiValued="true" termPositions="true" > >>> > >>> termVectors="true" > >>>> > >>>> termOffsets="true"/> > >>>> <field name="content_par" type="text_general_partial" indexed="true" > >>>> stored="true" multiValued="true" termPositions="true" > >>> > >>> termVectors="true" > >>>> > >>>> termOffsets="true"/> > >>>> <field name="text_par" type="text_general_partial" indexed="true" > >>>> stored="false" multiValued="true"/> > >>>> > >>>> > >>>> <!-- Copy source name, description, and content fields to name_par, > >>>> description_par, and content_par for partial word searches --> > >>>> <copyField source="name" dest="name_par"/> > >>>> <copyField source="description" dest="description_par"/> > >>>> <copyField source="content" dest="content_par"/> > >>>> > >>>> <!-- Copy source name, description, and content fields to catch-all > >>> > >>> text > >>>> > >>>> field for faster querying. --> > >>>> <copyField source="name" dest="text"/> > >>>> <copyField source="description" dest="text"/> > >>>> <copyField source="content" dest="text"/> > >>>> > >>>> <!-- Copy source name, description, and content fields to catch-all > >>>> text_par field for faster querying of partial word searches. --> > >>>> <copyField source="name" dest="text_par"/> > >>>> <copyField source="description" dest="text_par"/> > >>>> <copyField source="content" dest="text_par"/> > >>>> > >>>> <!-- A text field for whole word matches --> > >>>> <fieldType name="text_general" class="solr.TextField" > >>>> positionIncrementGap="100"> > >>>> <analyzer type="index"> > >>>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>>> words="stopwords.txt" enablePositionIncrements="true" /> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> </analyzer> > >>>> <analyzer type="query"> > >>>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>>> words="stopwords.txt" enablePositionIncrements="true" /> > >>>> <filter class="solr.SynonymFilterFactory" > >> > >> synonyms="synonyms.txt" > >>>> > >>>> ignoreCase="true" expand="true"/> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> </analyzer> > >>>> </fieldType> > >>>> > >>>> <!-- A text field for parital matches --> > >>>> <fieldType name="text_general_partial" class="solr.TextField" > >>>> positionIncrementGap="100"> > >>>> <analyzer type="index"> > >>>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>>> words="stopwords.txt" enablePositionIncrements="true" /> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" > >>>> maxGramSize="7"/> > >>>> </analyzer> > >>>> <analyzer type="query"> > >>>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>>> words="stopwords.txt" enablePositionIncrements="true" /> > >>>> <filter class="solr.SynonymFilterFactory" > >> > >> synonyms="synonyms.txt" > >>>> > >>>> ignoreCase="true" expand="true"/> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> </analyzer> > >>>> </fieldType> > >>>> > >>>> > >>>> > >>>> <requestHandler name="/select" class="solr.SearchHandler"> > >>>> <!-- default values for query parameters can be specified, these > >>>> will be overridden by parameters in the request. --> > >>>> <lst name="defaults"> > >>>> <str name="echoParams">explicit</str> > >>>> <int name="rows">10</int> > >>>> <str name="df">text</str> > >>>> <str name="defType">edismax</str> > >>>> <str name="qf">text^2 text_par^1</str> <!-- Boost whole > >>>> word matches more than partial matches in the scroing. --> > >>>> <bool name="termVectors">true</bool> > >>>> <bool name="termPositions">true</bool> > >>>> <bool name="termOffsets">true</bool> > >>>> <bool name="hl.useFastVectorHighlighter">true</bool> > >>>> <str name="hl.boundaryScanner">breakIterator</str> > >>>> <str name="hl.snippets">2</str> > >>>> <str name="hl.fl">name name_par description description_par > >>>> content content_par</str> > >>>> <int name="hl.fragsize">162</int> > >>>> <str name="hl.fragListBuilder">simple</str> > >>>> <str name="hl.fragmentsBuilder">default</str> > >>>> <str name="hl.simple.pre"><![CDATA[<strong>]]></str> > >>>> <str name="hl.simple.post"><![CDATA[</strong>]]></str> > >>>> <str name="hl.tag.pre"><![CDATA[<strong>]]></str> > >>>> <str name="hl.tag.post"><![CDATA[</strong>]]></str> > >>>> </lst> > >>>> </requestHandler> > >>>> > >>>> > >>>> Cheers! > >>>> > >>>> - Andy > > > > >