Re: Solr searching performance issues, using large documents

Lance Norskog Sun, 01 Aug 2010 19:22:16 -0700

Not that I know of.

The DataImportHandler has the ability to create multiple documents
from one input stream. It is possible to create a DIH file that reads
large log files and splits each one into N documents, with the file
name as a common field. The DIH wiki page tells you in general how to
make a DIH file.


http://wiki.apache.org/solr/DataImportHandler

>From this, you should be able to make a DIH file that puts log files
in as separate documents. As to splitting files up into
mini-documents, you might have to write a bit of Javascript to achieve
this. There is no data structure or software that implements
structured documents.

On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>
>
> -Peter
>
> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>
>> Ah! You're not just highlighting, you're snippetizing. This makes it easier.
>>
>> Highlighting does not stream- it pulls the entire stored contents into
>> one string and then pulls out the snippet.  If you want this to be
>> fast, you have to split up the text into small pieces and only
>> snippetize from the most relevant text. So, separate documents with a
>> common group id for the document it came from. You might have to do 2
>> queries to achieve what you want, but the second query for the same
>> query will be blindingly fast. Often <1ms.
>>
>> Good luck!
>>
>> Lance
>>
>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>> However, I do need to search the entire document, or else the highlighting 
>>> will sometimes be blank :-(
>>> Thanks!
>>>
>>> - Peter
>>>
>>> ps. sorry for the many responses - I'm rushing around trying to get this 
>>> working.
>>>
>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>
>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the 
>>>> hl.regex.maxAnalyzedChars the first time.
>>>> Thanks!
>>>>
>>>> -Peter
>>>>
>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>
>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>
>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>
>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact 
>>>>> (one search I just tried went from 17 seconds to 15.8 seconds, and this 
>>>>> is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>
>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>> "~someTerm" instead "someTerm"
>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>>>>> other wildcard search.
>>>>>
>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>
>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>
>>>>>
>>>>> - Peter
>>>>>
>>>>>> Regards,
>>>>>> Peter.
>>>>>>
>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  
>>>>>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>
>>>>>>> Problem: When I search for common terms, the query time goes from under 
>>>>>>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I 
>>>>>>> disable highlighting, performance improves a lot, but is still slow for 
>>>>>>> some queries (7 seconds).  Thanks in advance for any ideas!
>>>>>>>
>>>>>>>
>>>>>>> -Peter
>>>>>>>
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> 4GB RAM server
>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> schema.xml changes:
>>>>>>>
>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>    <analyzer>
>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" 
>>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>    </analyzer>
>>>>>>>  </fieldType>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" 
>>>>>>> multiValued="false" termVectors="true" termPositions="true" 
>>>>>>> termOffsets="true" />
>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" 
>>>>>>> default="NOW" multiValued="false"/>
>>>>>>> <field name="version" type="string" indexed="true" stored="true" 
>>>>>>> multiValued="false"/>
>>>>>>> <field name="device" type="string" indexed="true" stored="true" 
>>>>>>> multiValued="false"/>
>>>>>>> <field name="filename" type="string" indexed="true" stored="true" 
>>>>>>> multiValued="false"/>
>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" 
>>>>>>> multiValued="false"/>
>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" 
>>>>>>> multiValued="false"/>
>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" 
>>>>>>> multiValued="false"/>
>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" 
>>>>>>> multiValued="false"/>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> solrconfig.xml changes:
>>>>>>>
>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> The query:
>>>>>>>
>>>>>>> rowStr = "&rows=10"
>>>>>>> facet = 
>>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
>>>>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>
>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? 
>>>>>>> '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + 
>>>>>>> termvectors + hl + hl_regex
>>>>>>>
>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' 
>>>>>>> + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> http://karussell.wordpress.com/
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr searching performance issues, using large documents

Reply via email to