Not that I know of. The DataImportHandler has the ability to create multiple documents from one input stream. It is possible to create a DIH file that reads large log files and splits each one into N documents, with the file name as a common field. The DIH wiki page tells you in general how to make a DIH file.
http://wiki.apache.org/solr/DataImportHandler >From this, you should be able to make a DIH file that puts log files in as separate documents. As to splitting files up into mini-documents, you might have to write a bit of Javascript to achieve this. There is no data structure or software that implements structured documents. On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote: > Thanks for the pointer, Lance! Is there an example of this somewhere? > > > -Peter > > On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote: > >> Ah! You're not just highlighting, you're snippetizing. This makes it easier. >> >> Highlighting does not stream- it pulls the entire stored contents into >> one string and then pulls out the snippet. If you want this to be >> fast, you have to split up the text into small pieces and only >> snippetize from the most relevant text. So, separate documents with a >> common group id for the document it came from. You might have to do 2 >> queries to achieve what you want, but the second query for the same >> query will be blindingly fast. Often <1ms. >> >> Good luck! >> >> Lance >> >> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote: >>> However, I do need to search the entire document, or else the highlighting >>> will sometimes be blank :-( >>> Thanks! >>> >>> - Peter >>> >>> ps. sorry for the many responses - I'm rushing around trying to get this >>> working. >>> >>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote: >>> >>>> Correction - it went from 17 seconds to 10 seconds - I was changing the >>>> hl.regex.maxAnalyzedChars the first time. >>>> Thanks! >>>> >>>> -Peter >>>> >>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote: >>>> >>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote: >>>>> >>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647 >>>>> >>>>> Yes, I tried dropping it down to 21, but it didn't have much of an impact >>>>> (one search I just tried went from 17 seconds to 15.8 seconds, and this >>>>> is an 8-core Mac Pro with 6GB RAM - 4GB for java). >>>>> >>>>>> ? Also regular expression highlighting is more expensive, I think. >>>>>> What does the 'fuzzy' variable mean? If you use this to query via >>>>>> "~someTerm" instead "someTerm" >>>>>> then you should try the trunk of solr which is a lot faster for fuzzy or >>>>>> other wildcard search. >>>>> >>>>> "fuzzy" could be set to "*" but isn't right now. >>>>> >>>>> Thanks for the tips, Peter - this has been very frustrating! >>>>> >>>>> >>>>> - Peter >>>>> >>>>>> Regards, >>>>>> Peter. >>>>>> >>>>>>> Data set: About 4,000 log files (will eventually grow to millions). >>>>>>> Average log file is 850k. Largest log file (so far) is about 70MB. >>>>>>> >>>>>>> Problem: When I search for common terms, the query time goes from under >>>>>>> 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I >>>>>>> disable highlighting, performance improves a lot, but is still slow for >>>>>>> some queries (7 seconds). Thanks in advance for any ideas! >>>>>>> >>>>>>> >>>>>>> -Peter >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> 4GB RAM server >>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> schema.xml changes: >>>>>>> >>>>>>> <fieldType name="text_pl" class="solr.TextField"> >>>>>>> <analyzer> >>>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> >>>>>>> </analyzer> >>>>>>> </fieldType> >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" >>>>>>> multiValued="false" termVectors="true" termPositions="true" >>>>>>> termOffsets="true" /> >>>>>>> <field name="timestamp" type="date" indexed="true" stored="true" >>>>>>> default="NOW" multiValued="false"/> >>>>>>> <field name="version" type="string" indexed="true" stored="true" >>>>>>> multiValued="false"/> >>>>>>> <field name="device" type="string" indexed="true" stored="true" >>>>>>> multiValued="false"/> >>>>>>> <field name="filename" type="string" indexed="true" stored="true" >>>>>>> multiValued="false"/> >>>>>>> <field name="filesize" type="long" indexed="true" stored="true" >>>>>>> multiValued="false"/> >>>>>>> <field name="pversion" type="int" indexed="true" stored="true" >>>>>>> multiValued="false"/> >>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" >>>>>>> multiValued="false"/> >>>>>>> <field name="ckey" type="string" indexed="true" stored="true" >>>>>>> multiValued="false"/> >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> <dynamicField name="*" type="ignored" multiValued="true" /> >>>>>>> <defaultSearchField>body</defaultSearchField> >>>>>>> <solrQueryParser defaultOperator="AND"/> >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> solrconfig.xml changes: >>>>>>> >>>>>>> <maxFieldLength>2147483647</maxFieldLength> >>>>>>> <ramBufferSizeMB>128</ramBufferSizeMB> >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> The query: >>>>>>> >>>>>>> rowStr = "&rows=10" >>>>>>> facet = >>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version" >>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey" >>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true" >>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400" >>>>>>> regexv = "(?m)^.*\n.*\n.*$" >>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + >>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647" >>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, >>>>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr) >>>>>>> >>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? >>>>>>> '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + >>>>>>> termvectors + hl + hl_regex >>>>>>> >>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' >>>>>>> + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> http://karussell.wordpress.com/ >>>>>> >>>>> >>>> >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com > > -- Lance Norskog goks...@gmail.com