Re: Solr searching performance issues, using large documents

Peter Spam Fri, 30 Jul 2010 10:01:59 -0700

I do store term vector:

<field name="body" type="text_pl" indexed="true" stored="true" 
multiValued="false" termVectors="true" termPositions="true" termOffsets="true" 
/>


-Pete

On Jul 30, 2010, at 7:30 AM, Li Li wrote:

> hightlight's time is mainly spent on getting the field which you want
> to highlight and tokenize this field(If you don't store term vector) .
> you can check what's wrong,
> 
> 2010/7/30 Peter Spam <[email protected]>:
>> If I don't do highlighting, it's really fast.  Optimize has no effect.
>> 
>> -Peter
>> 
>> On Jul 29, 2010, at 11:54 AM, dc tech wrote:
>> 
>>> Are you storing the entire log file text in SOLR? That's almost 3gb of
>>> text that you are storing in the SOLR. Try to
>>> 1) Is this first time performance or on repaat queries with the same fields?
>>> 2) Optimze the index and test performance again
>>> 3) index without storing the text and see what the performance looks like.
>>> 
>>> 
>>> On 7/29/10, Peter Spam <[email protected]> wrote:
>>>> Any ideas?  I've got 5000 documents with an average size of 850k each, and
>>>> it sometimes takes 2 minutes for a query to come back when highlighting is
>>>> turned on!  Help!
>>>> 
>>>> 
>>>> -Pete
>>>> 
>>>> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>>>> 
>>>>> From the mailing list archive, Koji wrote:
>>>>> 
>>>>>> 1. Provide another field for highlighting and use copyField to copy
>>>>>> plainText to the highlighting field.
>>>>> 
>>>>> and Lance wrote:
>>>>> http://www.mail-archive.com/[email protected]/msg35548.html
>>>>> 
>>>>>> If you want to highlight field X, doing the
>>>>>> termOffsets/termPositions/termVectors will make highlighting that field
>>>>>> faster. You should make a separate field and apply these options to that
>>>>>> field.
>>>>>> 
>>>>>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>>>>>> field, you get a multi-valued text field. You should only copy one value
>>>>>> to the highlighted field, so just copyField the document to your special
>>>>>> field. To enforce this, I would add multiValued="false" to that field,
>>>>>> just to avoid mistakes.
>>>>>> 
>>>>>> So, all_text should be indexed without the term* attributes, and should
>>>>>> not be stored. Then your document stored in a separate field that you use
>>>>>> for highlighting and has the term* attributes.
>>>>> 
>>>>> I've been experimenting with this, and here's what I've tried:
>>>>> 
>>>>>  <field name="body" type="text_pl" indexed="true" stored="false"
>>>>> multiValued="true" termVectors="true" termPositions="true" termOff
>>>>> sets="true" />
>>>>>  <field name="body_all" type="text_pl" indexed="false" stored="true"
>>>>> multiValued="true" />
>>>>>  <copyField source="body" dest="body_all"/>
>>>>> 
>>>>> ... but it's still very slow (10+ seconds).  Why is it better to have two
>>>>> fields (one indexed but not stored, and the other not indexed but stored)
>>>>> rather than just one field that's both indexed and stored?
>>>>> 
>>>>> 
>>>>> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
>>>>> 
>>>>>> If you aren't always using all the stored fields, then enabling lazy
>>>>>> field loading can be a huge boon, especially if compressed fields are
>>>>>> used.
>>>>> 
>>>>> What does this mean?  How do you load a field lazily?
>>>>> 
>>>>> Thanks for your time, guys - this has started to become frustrating, since
>>>>> it works so well, but is very slow!
>>>>> 
>>>>> 
>>>>> -Pete
>>>>> 
>>>>> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
>>>>> 
>>>>>> Data set: About 4,000 log files (will eventually grow to millions).
>>>>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>> 
>>>>>> Problem: When I search for common terms, the query time goes from under
>>>>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>>>>>> disable highlighting, performance improves a lot, but is still slow for
>>>>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>> 
>>>>>> 
>>>>>> -Peter
>>>>>> 
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> 4GB RAM server
>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> schema.xml changes:
>>>>>> 
>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>    <analyzer>
>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>>>>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>>>    </analyzer>
>>>>>>  </fieldType>
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>>>> multiValued="false" termVectors="true" termPositions="true"
>>>>>> termOffsets="true" />
>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true"
>>>>>> default="NOW" multiValued="false"/>
>>>>>> <field name="version" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="device" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="filename" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="filesize" type="long" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="pversion" type="int" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="first2md5" type="string" indexed="false" stored="true"
>>>>>> multiValued="false"/>
>>>>>> <field name="ckey" type="string" indexed="true" stored="true"
>>>>>> multiValued="false"/>
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> solrconfig.xml changes:
>>>>>> 
>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>> 
>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>> The query:
>>>>>> 
>>>>>> rowStr = "&rows=10"
>>>>>> facet =
>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>> 
>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? ''
>>>>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
>>>>>> + hl + hl_regex
>>>>>> 
>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' +
>>>>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> Sent from my mobile device
>> 
>>

Re: Solr searching performance issues, using large documents

Reply via email to