So, I went through all the effort to break my documents into max 1 MB chunks, and searching for hello still takes over 40 seconds (searching across 7433 documents):
8 results (41980 ms) What is going on??? (scroll down for my config). -Peter On Aug 16, 2010, at 3:59 PM, Markus Jelsma wrote: > I've no idea if it's possible but i'd at least try to return an ArrayList of > rows instead of just a single row. And if it doesn't work, which is probably > the case, how about filing an issue in Jira? > > > > Reading the docs in the matter, i think it should (made) to be possible to > return multiple rows in an ArrayList. > > -----Original message----- > From: Peter Spam <ps...@mac.com> > Sent: Tue 17-08-2010 00:47 > To: solr-user@lucene.apache.org; > Subject: Re: Solr searching performance issues, using large documents > > Still stuck on this - any hints on how to write the JavaScript to split a > document? Thanks! > > > -Pete > > On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote: > >> You may have to write your own javascript to read in the giant field >> and split it up. >> >> On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote: >>> I've read through the DataImportHandler page a few times, and still can't >>> figure out how to separate a large document into smaller documents. Any >>> hints? :-) Thanks! >>> >>> -Peter >>> >>> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote: >>> >>>> Spanning won't work- you would have to make overlapping mini-documents >>>> if you want to support this. >>>> >>>> I don't know how big the chunks should be- you'll have to experiment. >>>> >>>> Lance >>>> >>>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote: >>>>> What would happen if the search query phrase spanned separate document >>>>> chunks? >>>>> >>>>> Also, what would the optimal size of chunks be? >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> -Peter >>>>> >>>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote: >>>>> >>>>>> Not that I know of. >>>>>> >>>>>> The DataImportHandler has the ability to create multiple documents >>>>>> from one input stream. It is possible to create a DIH file that reads >>>>>> large log files and splits each one into N documents, with the file >>>>>> name as a common field. The DIH wiki page tells you in general how to >>>>>> make a DIH file. >>>>>> >>>>>> http://wiki.apache.org/solr/DataImportHandler >>>>>> >>>>>> From this, you should be able to make a DIH file that puts log files >>>>>> in as separate documents. As to splitting files up into >>>>>> mini-documents, you might have to write a bit of Javascript to achieve >>>>>> this. There is no data structure or software that implements >>>>>> structured documents. >>>>>> >>>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote: >>>>>>> Thanks for the pointer, Lance! Is there an example of this somewhere? >>>>>>> >>>>>>> >>>>>>> -Peter >>>>>>> >>>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote: >>>>>>> >>>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it >>>>>>>> easier. >>>>>>>> >>>>>>>> Highlighting does not stream- it pulls the entire stored contents into >>>>>>>> one string and then pulls out the snippet. If you want this to be >>>>>>>> fast, you have to split up the text into small pieces and only >>>>>>>> snippetize from the most relevant text. So, separate documents with a >>>>>>>> common group id for the document it came from. You might have to do 2 >>>>>>>> queries to achieve what you want, but the second query for the same >>>>>>>> query will be blindingly fast. Often <1ms. >>>>>>>> >>>>>>>> Good luck! >>>>>>>> >>>>>>>> Lance >>>>>>>> >>>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote: >>>>>>>>> However, I do need to search the entire document, or else the >>>>>>>>> highlighting will sometimes be blank :-( >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> - Peter >>>>>>>>> >>>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get >>>>>>>>> this working. >>>>>>>>> >>>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: >>>>>>>>> >>>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing >>>>>>>>>> the hl.regex.maxAnalyzedChars the first time. >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> -Peter >>>>>>>>>> >>>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: >>>>>>>>>> >>>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: >>>>>>>>>>> >>>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647 >>>>>>>>>>> >>>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an >>>>>>>>>>> impact (one search I just tried went from 17 seconds to 15.8 >>>>>>>>>>> seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). >>>>>>>>>>> >>>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think. >>>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via >>>>>>>>>>>> "~someTerm" instead "someTerm" >>>>>>>>>>>> then you should try the trunk of solr which is a lot faster for >>>>>>>>>>>> fuzzy or >>>>>>>>>>>> other wildcard search. >>>>>>>>>>> >>>>>>>>>>> "fuzzy" could be set to "*" but isn't right now. >>>>>>>>>>> >>>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Peter >>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Peter. >>>>>>>>>>>> >>>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to >>>>>>>>>>>>> millions). Average log file is 850k. Largest log file (so far) >>>>>>>>>>>>> is about 70MB. >>>>>>>>>>>>> >>>>>>>>>>>>> Problem: When I search for common terms, the query time goes from >>>>>>>>>>>>> under 2-3 seconds to about 60 seconds. TermVectors etc are >>>>>>>>>>>>> enabled. When I disable highlighting, performance improves a >>>>>>>>>>>>> lot, but is still slow for some queries (7 seconds). Thanks in >>>>>>>>>>>>> advance for any ideas! >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -Peter >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> 4GB RAM server >>>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> schema.xml changes: >>>>>>>>>>>>> >>>>>>>>>>>>> <fieldType name="text_pl" class="solr.TextField"> >>>>>>>>>>>>> <analyzer> >>>>>>>>>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>>>>>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" >>>>>>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> >>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>> </fieldType> >>>>>>>>>>>>> >>>>>>>>>>>>> ... >>>>>>>>>>>>> >>>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" >>>>>>>>>>>>> multiValued="false" termVectors="true" termPositions="true" >>>>>>>>>>>>> termOffsets="true" /> >>>>>>>>>>>>> <field name="timestamp" type="date" indexed="true" stored="true" >>>>>>>>>>>>> default="NOW" multiValued="false"/> >>>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" >>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" >>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" >>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" >>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" >>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" >>>>>>>>>>>>> stored="true" multiValued="false"/> >>>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" >>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>> >>>>>>>>>>>>> ... >>>>>>>>>>>>> >>>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" /> >>>>>>>>>>>>> <defaultSearchField>body</defaultSearchField> >>>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/> >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> solrconfig.xml changes: >>>>>>>>>>>>> >>>>>>>>>>>>> <maxFieldLength>2147483647</maxFieldLength> >>>>>>>>>>>>> <ramBufferSizeMB>128</ramBufferSizeMB> >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> The query: >>>>>>>>>>>>> >>>>>>>>>>>>> rowStr = "&rows=10" >>>>>>>>>>>>> facet = >>>>>>>>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version" >>>>>>>>>>>>> fields = >>>>>>>>>>>>> "&fl=id,score,filename,version,device,first2md5,filesize,ckey" >>>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true" >>>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400" >>>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$" >>>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + >>>>>>>>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647" >>>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + >>>>>>>>>>>>> p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + >>>>>>>>>>>>> minLogSizeStr) >>>>>>>>>>>>> >>>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + >>>>>>>>>>>>> (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + >>>>>>>>>>>>> facet + fields + termvectors + hl + hl_regex >>>>>>>>>>>>> >>>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + >>>>>>>>>>>>> '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> http://karussell.wordpress.com/ >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Lance Norskog >>>>>>>> goks...@gmail.com >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Lance Norskog >>>>>> goks...@gmail.com >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Lance Norskog >>>> goks...@gmail.com >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >