Re: Solr searching performance issues, using large documents (now 1MB documents)

Peter Spam Wed, 25 Aug 2010 08:30:27 -0700

So, I went through all the effort to break my documents into max 1 MB chunks, 
and searching for hello still takes over 40 seconds (searching across 7433 
documents):


        8 results (41980 ms)

What is going on???  (scroll down for my config).


-Peter
 
On Aug 16, 2010, at 3:59 PM, Markus Jelsma wrote:

> I've no idea if it's possible but i'd at least try to return an ArrayList of 
> rows instead of just a single row. And if it doesn't work, which is probably 
> the case, how about filing an issue in Jira?
> 
>  
> 
> Reading the docs in the matter, i think it should (made) to be possible to 
> return multiple rows in an ArrayList.
>  
> -----Original message-----
> From: Peter Spam <ps...@mac.com>
> Sent: Tue 17-08-2010 00:47
> To: solr-user@lucene.apache.org; 
> Subject: Re: Solr searching performance issues, using large documents
> 
> Still stuck on this - any hints on how to write the JavaScript to split a 
> document?  Thanks!
> 
> 
> -Pete
> 
> On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote:
> 
>> You may have to write your own javascript to read in the giant field
>> and split it up.
>> 
>> On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote:
>>> I've read through the DataImportHandler page a few times, and still can't 
>>> figure out how to separate a large document into smaller documents.  Any 
>>> hints? :-)  Thanks!
>>> 
>>> -Peter
>>> 
>>> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>>> 
>>>> Spanning won't work- you would have to make overlapping mini-documents
>>>> if you want to support this.
>>>> 
>>>> I don't know how big the chunks should be- you'll have to experiment.
>>>> 
>>>> Lance
>>>> 
>>>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>>>>> What would happen if the search query phrase spanned separate document 
>>>>> chunks?
>>>>> 
>>>>> Also, what would the optimal size of chunks be?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> -Peter
>>>>> 
>>>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>>>>> 
>>>>>> Not that I know of.
>>>>>> 
>>>>>> The DataImportHandler has the ability to create multiple documents
>>>>>> from one input stream. It is possible to create a DIH file that reads
>>>>>> large log files and splits each one into N documents, with the file
>>>>>> name as a common field. The DIH wiki page tells you in general how to
>>>>>> make a DIH file.
>>>>>> 
>>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>>> 
>>>>>> From this, you should be able to make a DIH file that puts log files
>>>>>> in as separate documents. As to splitting files up into
>>>>>> mini-documents, you might have to write a bit of Javascript to achieve
>>>>>> this. There is no data structure or software that implements
>>>>>> structured documents.
>>>>>> 
>>>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>>>>> 
>>>>>>> 
>>>>>>> -Peter
>>>>>>> 
>>>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>>>>> 
>>>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it 
>>>>>>>> easier.
>>>>>>>> 
>>>>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>>>>> fast, you have to split up the text into small pieces and only
>>>>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>>>>> common group id for the document it came from. You might have to do 2
>>>>>>>> queries to achieve what you want, but the second query for the same
>>>>>>>> query will be blindingly fast. Often <1ms.
>>>>>>>> 
>>>>>>>> Good luck!
>>>>>>>> 
>>>>>>>> Lance
>>>>>>>> 
>>>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>>>>> However, I do need to search the entire document, or else the 
>>>>>>>>> highlighting will sometimes be blank :-(
>>>>>>>>> Thanks!
>>>>>>>>> 
>>>>>>>>> - Peter
>>>>>>>>> 
>>>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get 
>>>>>>>>> this working.
>>>>>>>>> 
>>>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>>>>> 
>>>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing 
>>>>>>>>>> the hl.regex.maxAnalyzedChars the first time.
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> -Peter
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an 
>>>>>>>>>>> impact (one search I just tried went from 17 seconds to 15.8 
>>>>>>>>>>> seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>>>>> 
>>>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>>>>> then you should try the trunk of solr which is a lot faster for 
>>>>>>>>>>>> fuzzy or
>>>>>>>>>>>> other wildcard search.
>>>>>>>>>>> 
>>>>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> - Peter
>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Peter.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to 
>>>>>>>>>>>>> millions).  Average log file is 850k.  Largest log file (so far) 
>>>>>>>>>>>>> is about 70MB.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Problem: When I search for common terms, the query time goes from 
>>>>>>>>>>>>> under 2-3 seconds to about 60 seconds.  TermVectors etc are 
>>>>>>>>>>>>> enabled.  When I disable highlighting, performance improves a 
>>>>>>>>>>>>> lot, but is still slow for some queries (7 seconds).  Thanks in 
>>>>>>>>>>>>> advance for any ideas!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Peter
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 4GB RAM server
>>>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> schema.xml changes:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>>>>    <analyzer>
>>>>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" 
>>>>>>>>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>>>>>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>>>  </fieldType>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" 
>>>>>>>>>>>>> multiValued="false" termVectors="true" termPositions="true" 
>>>>>>>>>>>>> termOffsets="true" />
>>>>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" 
>>>>>>>>>>>>> default="NOW" multiValued="false"/>
>>>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" 
>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" 
>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" 
>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" 
>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" 
>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" 
>>>>>>>>>>>>> stored="true" multiValued="false"/>
>>>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" 
>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The query:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>>>>> facet = 
>>>>>>>>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>>>>> fields = 
>>>>>>>>>>>>> "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>>>>>>>>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + 
>>>>>>>>>>>>> p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + 
>>>>>>>>>>>>> minLogSizeStr)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + 
>>>>>>>>>>>>> (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + 
>>>>>>>>>>>>> facet + fields + termvectors + hl + hl_regex
>>>>>>>>>>>>> 
>>>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 
>>>>>>>>>>>>> '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Lance Norskog
>>>>>>>> goks...@gmail.com
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goks...@gmail.com
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lance Norskog
>>>> goks...@gmail.com
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Lance Norskog
>> goks...@gmail.com
>

Re: Solr searching performance issues, using large documents (now 1MB documents)

Reply via email to