Re: Solr searching performance issues, using large documents

Lance Norskog Thu, 05 Aug 2010 20:11:32 -0700

You may have to write your own javascript to read in the giant field
and split it up.


On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote:
> I've read through the DataImportHandler page a few times, and still can't 
> figure out how to separate a large document into smaller documents.  Any 
> hints? :-)  Thanks!
>
> -Peter
>
> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>
>> Spanning won't work- you would have to make overlapping mini-documents
>> if you want to support this.
>>
>> I don't know how big the chunks should be- you'll have to experiment.
>>
>> Lance
>>
>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>>> What would happen if the search query phrase spanned separate document 
>>> chunks?
>>>
>>> Also, what would the optimal size of chunks be?
>>>
>>> Thanks!
>>>
>>>
>>> -Peter
>>>
>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>>>
>>>> Not that I know of.
>>>>
>>>> The DataImportHandler has the ability to create multiple documents
>>>> from one input stream. It is possible to create a DIH file that reads
>>>> large log files and splits each one into N documents, with the file
>>>> name as a common field. The DIH wiki page tells you in general how to
>>>> make a DIH file.
>>>>
>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>
>>>> From this, you should be able to make a DIH file that puts log files
>>>> in as separate documents. As to splitting files up into
>>>> mini-documents, you might have to write a bit of Javascript to achieve
>>>> this. There is no data structure or software that implements
>>>> structured documents.
>>>>
>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>>>
>>>>>
>>>>> -Peter
>>>>>
>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>>>
>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it 
>>>>>> easier.
>>>>>>
>>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>>> fast, you have to split up the text into small pieces and only
>>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>>> common group id for the document it came from. You might have to do 2
>>>>>> queries to achieve what you want, but the second query for the same
>>>>>> query will be blindingly fast. Often <1ms.
>>>>>>
>>>>>> Good luck!
>>>>>>
>>>>>> Lance
>>>>>>
>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>> However, I do need to search the entire document, or else the 
>>>>>>> highlighting will sometimes be blank :-(
>>>>>>> Thanks!
>>>>>>>
>>>>>>> - Peter
>>>>>>>
>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get 
>>>>>>> this working.
>>>>>>>
>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>>>
>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing 
>>>>>>>> the hl.regex.maxAnalyzedChars the first time.
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> -Peter
>>>>>>>>
>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>>>
>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>>>
>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>>>
>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an 
>>>>>>>>> impact (one search I just tried went from 17 seconds to 15.8 seconds, 
>>>>>>>>> and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>>>
>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>>> then you should try the trunk of solr which is a lot faster for 
>>>>>>>>>> fuzzy or
>>>>>>>>>> other wildcard search.
>>>>>>>>>
>>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>>>
>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Peter
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Peter.
>>>>>>>>>>
>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions). 
>>>>>>>>>>>  Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>>>
>>>>>>>>>>> Problem: When I search for common terms, the query time goes from 
>>>>>>>>>>> under 2-3 seconds to about 60 seconds.  TermVectors etc are 
>>>>>>>>>>> enabled.  When I disable highlighting, performance improves a lot, 
>>>>>>>>>>> but is still slow for some queries (7 seconds).  Thanks in advance 
>>>>>>>>>>> for any ideas!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Peter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> 4GB RAM server
>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> schema.xml changes:
>>>>>>>>>>>
>>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>>    <analyzer>
>>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" 
>>>>>>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>>>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>  </fieldType>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" 
>>>>>>>>>>> multiValued="false" termVectors="true" termPositions="true" 
>>>>>>>>>>> termOffsets="true" />
>>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" 
>>>>>>>>>>> default="NOW" multiValued="false"/>
>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" 
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" 
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" 
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" 
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" 
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" 
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" 
>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>>>
>>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> The query:
>>>>>>>>>>>
>>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>>> facet = 
>>>>>>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>>> fields = 
>>>>>>>>>>> "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>>>>>>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + 
>>>>>>>>>>> p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + 
>>>>>>>>>>> minLogSizeStr)
>>>>>>>>>>>
>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + 
>>>>>>>>>>> (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + 
>>>>>>>>>>> facet + fields + termvectors + hl + hl_regex
>>>>>>>>>>>
>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 
>>>>>>>>>>> '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goks...@gmail.com
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goks...@gmail.com
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr searching performance issues, using large documents

Reply via email to