Still stuck on this - any hints on how to write the JavaScript to split a document? Thanks!
-Pete On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote: > You may have to write your own javascript to read in the giant field > and split it up. > > On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <ps...@mac.com> wrote: >> I've read through the DataImportHandler page a few times, and still can't >> figure out how to separate a large document into smaller documents. Any >> hints? :-) Thanks! >> >> -Peter >> >> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote: >> >>> Spanning won't work- you would have to make overlapping mini-documents >>> if you want to support this. >>> >>> I don't know how big the chunks should be- you'll have to experiment. >>> >>> Lance >>> >>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote: >>>> What would happen if the search query phrase spanned separate document >>>> chunks? >>>> >>>> Also, what would the optimal size of chunks be? >>>> >>>> Thanks! >>>> >>>> >>>> -Peter >>>> >>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote: >>>> >>>>> Not that I know of. >>>>> >>>>> The DataImportHandler has the ability to create multiple documents >>>>> from one input stream. It is possible to create a DIH file that reads >>>>> large log files and splits each one into N documents, with the file >>>>> name as a common field. The DIH wiki page tells you in general how to >>>>> make a DIH file. >>>>> >>>>> http://wiki.apache.org/solr/DataImportHandler >>>>> >>>>> From this, you should be able to make a DIH file that puts log files >>>>> in as separate documents. As to splitting files up into >>>>> mini-documents, you might have to write a bit of Javascript to achieve >>>>> this. There is no data structure or software that implements >>>>> structured documents. >>>>> >>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote: >>>>>> Thanks for the pointer, Lance! Is there an example of this somewhere? >>>>>> >>>>>> >>>>>> -Peter >>>>>> >>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote: >>>>>> >>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it >>>>>>> easier. >>>>>>> >>>>>>> Highlighting does not stream- it pulls the entire stored contents into >>>>>>> one string and then pulls out the snippet. If you want this to be >>>>>>> fast, you have to split up the text into small pieces and only >>>>>>> snippetize from the most relevant text. So, separate documents with a >>>>>>> common group id for the document it came from. You might have to do 2 >>>>>>> queries to achieve what you want, but the second query for the same >>>>>>> query will be blindingly fast. Often <1ms. >>>>>>> >>>>>>> Good luck! >>>>>>> >>>>>>> Lance >>>>>>> >>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote: >>>>>>>> However, I do need to search the entire document, or else the >>>>>>>> highlighting will sometimes be blank :-( >>>>>>>> Thanks! >>>>>>>> >>>>>>>> - Peter >>>>>>>> >>>>>>>> ps. sorry for the many responses - I'm rushing around trying to get >>>>>>>> this working. >>>>>>>> >>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: >>>>>>>> >>>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing >>>>>>>>> the hl.regex.maxAnalyzedChars the first time. >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> -Peter >>>>>>>>> >>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: >>>>>>>>> >>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: >>>>>>>>>> >>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647 >>>>>>>>>> >>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an >>>>>>>>>> impact (one search I just tried went from 17 seconds to 15.8 >>>>>>>>>> seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for java). >>>>>>>>>> >>>>>>>>>>> ? Also regular expression highlighting is more expensive, I think. >>>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via >>>>>>>>>>> "~someTerm" instead "someTerm" >>>>>>>>>>> then you should try the trunk of solr which is a lot faster for >>>>>>>>>>> fuzzy or >>>>>>>>>>> other wildcard search. >>>>>>>>>> >>>>>>>>>> "fuzzy" could be set to "*" but isn't right now. >>>>>>>>>> >>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Peter >>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Peter. >>>>>>>>>>> >>>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to >>>>>>>>>>>> millions). Average log file is 850k. Largest log file (so far) >>>>>>>>>>>> is about 70MB. >>>>>>>>>>>> >>>>>>>>>>>> Problem: When I search for common terms, the query time goes from >>>>>>>>>>>> under 2-3 seconds to about 60 seconds. TermVectors etc are >>>>>>>>>>>> enabled. When I disable highlighting, performance improves a lot, >>>>>>>>>>>> but is still slow for some queries (7 seconds). Thanks in advance >>>>>>>>>>>> for any ideas! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -Peter >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> 4GB RAM server >>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> schema.xml changes: >>>>>>>>>>>> >>>>>>>>>>>> <fieldType name="text_pl" class="solr.TextField"> >>>>>>>>>>>> <analyzer> >>>>>>>>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>>>>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" >>>>>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> >>>>>>>>>>>> </analyzer> >>>>>>>>>>>> </fieldType> >>>>>>>>>>>> >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" >>>>>>>>>>>> multiValued="false" termVectors="true" termPositions="true" >>>>>>>>>>>> termOffsets="true" /> >>>>>>>>>>>> <field name="timestamp" type="date" indexed="true" stored="true" >>>>>>>>>>>> default="NOW" multiValued="false"/> >>>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" >>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" >>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" >>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" >>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" >>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>> <field name="first2md5" type="string" indexed="false" >>>>>>>>>>>> stored="true" multiValued="false"/> >>>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" >>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>> >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" /> >>>>>>>>>>>> <defaultSearchField>body</defaultSearchField> >>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/> >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> solrconfig.xml changes: >>>>>>>>>>>> >>>>>>>>>>>> <maxFieldLength>2147483647</maxFieldLength> >>>>>>>>>>>> <ramBufferSizeMB>128</ramBufferSizeMB> >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> The query: >>>>>>>>>>>> >>>>>>>>>>>> rowStr = "&rows=10" >>>>>>>>>>>> facet = >>>>>>>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version" >>>>>>>>>>>> fields = >>>>>>>>>>>> "&fl=id,score,filename,version,device,first2md5,filesize,ckey" >>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true" >>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400" >>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$" >>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + >>>>>>>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647" >>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + >>>>>>>>>>>> p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + >>>>>>>>>>>> minLogSizeStr) >>>>>>>>>>>> >>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + >>>>>>>>>>>> (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + >>>>>>>>>>>> facet + fields + termvectors + hl + hl_regex >>>>>>>>>>>> >>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + >>>>>>>>>>>> '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> http://karussell.wordpress.com/ >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Lance Norskog >>>>>>> goks...@gmail.com >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Lance Norskog >>>>> goks...@gmail.com >>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >> >> > > > > -- > Lance Norskog > goks...@gmail.com