Re: Solr searching performance issues, using large documents

Lance Norskog Sat, 31 Jul 2010 15:13:35 -0700

Ah! You're not just highlighting, you're snippetizing. This makes it easier.


Highlighting does not stream- it pulls the entire stored contents into
one string and then pulls out the snippet.  If you want this to be
fast, you have to split up the text into small pieces and only
snippetize from the most relevant text. So, separate documents with a
common group id for the document it came from. You might have to do 2
queries to achieve what you want, but the second query for the same
query will be blindingly fast. Often <1ms.

Good luck!

Lance

On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
> However, I do need to search the entire document, or else the highlighting 
> will sometimes be blank :-(
> Thanks!
>
> - Peter
>
> ps. sorry for the many responses - I'm rushing around trying to get this 
> working.
>
> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>
>> Correction - it went from 17 seconds to 10 seconds - I was changing the 
>> hl.regex.maxAnalyzedChars the first time.
>> Thanks!
>>
>> -Peter
>>
>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>
>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>
>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>
>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact 
>>> (one search I just tried went from 17 seconds to 15.8 seconds, and this is 
>>> an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>
>>>> ? Also regular expression highlighting is more expensive, I think.
>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>> "~someTerm" instead "someTerm"
>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>> other wildcard search.
>>>
>>> "fuzzy" could be set to "*" but isn't right now.
>>>
>>> Thanks for the tips, Peter - this has been very frustrating!
>>>
>>>
>>> - Peter
>>>
>>>> Regards,
>>>> Peter.
>>>>
>>>>> Data set: About 4,000 log files (will eventually grow to millions).  
>>>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>
>>>>> Problem: When I search for common terms, the query time goes from under 
>>>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I 
>>>>> disable highlighting, performance improves a lot, but is still slow for 
>>>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>
>>>>>
>>>>> -Peter
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> 4GB RAM server
>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> schema.xml changes:
>>>>>
>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>    <analyzer>
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
>>>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
>>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>>>    </analyzer>
>>>>>  </fieldType>
>>>>>
>>>>> ...
>>>>>
>>>>> <field name="body" type="text_pl" indexed="true" stored="true" 
>>>>> multiValued="false" termVectors="true" termPositions="true" 
>>>>> termOffsets="true" />
>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" 
>>>>> default="NOW" multiValued="false"/>
>>>>> <field name="version" type="string" indexed="true" stored="true" 
>>>>> multiValued="false"/>
>>>>> <field name="device" type="string" indexed="true" stored="true" 
>>>>> multiValued="false"/>
>>>>> <field name="filename" type="string" indexed="true" stored="true" 
>>>>> multiValued="false"/>
>>>>> <field name="filesize" type="long" indexed="true" stored="true" 
>>>>> multiValued="false"/>
>>>>> <field name="pversion" type="int" indexed="true" stored="true" 
>>>>> multiValued="false"/>
>>>>> <field name="first2md5" type="string" indexed="false" stored="true" 
>>>>> multiValued="false"/>
>>>>> <field name="ckey" type="string" indexed="true" stored="true" 
>>>>> multiValued="false"/>
>>>>>
>>>>> ...
>>>>>
>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>> <defaultSearchField>body</defaultSearchField>
>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> solrconfig.xml changes:
>>>>>
>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>
>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> The query:
>>>>>
>>>>> rowStr = "&rows=10"
>>>>> facet = 
>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
>>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>
>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' 
>>>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors 
>>>>> + hl + hl_regex
>>>>>
>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + 
>>>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> http://karussell.wordpress.com/
>>>>
>>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr searching performance issues, using large documents

Reply via email to