Re: Solr searching performance issues, using large documents

Peter Spam Sat, 31 Jul 2010 12:32:38 -0700

On Jul 30, 2010, at 7:04 PM, Lance Norskog wrote:

> Wait- how much text are you highlighting? You say these logfiles are X
> big- how big are the actual documents you are storing?


I want it to be like google - I put the entire (sometimes 60MB) doc in a field, 
and then just highlight 2-4 lines of it.


Thanks,
Peter


> On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich <peat...@yahoo.de> wrote:
>> Hi Peter :-),
>> 
>> did you already try other values for
>> 
>> hl.maxAnalyzedChars=2147483647
>> 
>> ? Also regular expression highlighting is more expensive, I think.
>> What does the 'fuzzy' variable mean? If you use this to query via
>> "~someTerm" instead "someTerm"
>> then you should try the trunk of solr which is a lot faster for fuzzy or
>> other wildcard search.
>> 
>> Regards,
>> Peter.
>> 
>>> Data set: About 4,000 log files (will eventually grow to millions).  
>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>> 
>>> Problem: When I search for common terms, the query time goes from under 2-3 
>>> seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
>>> highlighting, performance improves a lot, but is still slow for some 
>>> queries (7 seconds).  Thanks in advance for any ideas!
>>> 
>>> 
>>> -Peter
>>> 
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> 4GB RAM server
>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> schema.xml changes:
>>> 
>>>     <fieldType name="text_pl" class="solr.TextField">
>>>       <analyzer>
>>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>       </analyzer>
>>>     </fieldType>
>>> 
>>> ...
>>> 
>>>    <field name="body" type="text_pl" indexed="true" stored="true" 
>>> multiValued="false" termVectors="true" termPositions="true" 
>>> termOffsets="true" />
>>>     <field name="timestamp" type="date" indexed="true" stored="true" 
>>> default="NOW" multiValued="false"/>
>>>    <field name="version" type="string" indexed="true" stored="true" 
>>> multiValued="false"/>
>>>    <field name="device" type="string" indexed="true" stored="true" 
>>> multiValued="false"/>
>>>    <field name="filename" type="string" indexed="true" stored="true" 
>>> multiValued="false"/>
>>>    <field name="filesize" type="long" indexed="true" stored="true" 
>>> multiValued="false"/>
>>>    <field name="pversion" type="int" indexed="true" stored="true" 
>>> multiValued="false"/>
>>>    <field name="first2md5" type="string" indexed="false" stored="true" 
>>> multiValued="false"/>
>>>    <field name="ckey" type="string" indexed="true" stored="true" 
>>> multiValued="false"/>
>>> 
>>> ...
>>> 
>>>  <dynamicField name="*" type="ignored" multiValued="true" />
>>>  <defaultSearchField>body</defaultSearchField>
>>>  <solrQueryParser defaultOperator="AND"/>
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> solrconfig.xml changes:
>>> 
>>>     <maxFieldLength>2147483647</maxFieldLength>
>>>     <ramBufferSizeMB>128</ramBufferSizeMB>
>>> 
>>> -------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> The query:
>>> 
>>> rowStr = "&rows=10"
>>> facet = 
>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>> regexv = "(?m)^.*\n.*\n.*$"
>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>> 
>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : 
>>> ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + 
>>> hl + hl_regex
>>> 
>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + 
>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>> 
>>> 
>>> 
>> 
>> 
>> --
>> http://karussell.wordpress.com/
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com

Re: Solr searching performance issues, using large documents

Reply via email to