Re: Stored vs non-stored very large text fields

Jochen Barth Tue, 29 Apr 2014 03:22:27 -0700

BTW: stored field compression:
are all "stored fields" within a document are put into one compressed chunk,
or by per-field basis?


Kind regards,
J. Barth



> 
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr 
> proficiency
> 
> 
> On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
> <ba...@ub.uni-heidelberg.de> wrote:
>> Dear reader,
>>
>> I'm trying to use solr for a hierarchical search:
>> metadata from the higher-levelled elements is copied to the lower ones,
>> and each element has the complete ocr text which it belongs to.
>>
>> At volume level, of course, we will have the complete ocr text in one
>> <doc> and we need to store it for highlighting.
>>
>> My solr instance is configured like this:
>> java -Xms12000m -Xmx12000m -jar start.jar
>> [ imported with 4.7.0, performance tests with 4.8.0 ]
>>
>> Solr index files are of this size:
>>   0.013gb .tip The index into the Term Dictionary
>>   0.017gb .nvd Encodes length and boost factors for docs and fields
>>   0.546gb .tim The term dictionary, stores term info
>>   1.332gb .doc Contains the list of docs which contain each term along
>> with frequency
>>   4.943gb .pos Stores position information about where a term occurs in
>> the index
>>  12.743gb .tvd Contains information about each document that has term
>> vectors
>>  17.340gb .fdt The stored fields for documents "ocr"
>>
>> Configuring the ocr field as non-stored I'll get those performance
>> measures (see docs/s) after warmup:
>>
>> jb@serv7:~> perl solr-performance.pl zeit 6
>> http://127.0.0.1:58983/solr/collection1/select
>> ?wt=json
>> &q={%21q.op%3dAND}ocr%3A%28zeit%29
>> &fq=mashed_b%3Afalse
>> &fl=id
>> &sort=sort_name_s asc,id+asc
>> &rows=1000000
>> time: 3.96 s
>> bytes: 1.878 MB
>> 64768 docs found; got 64768 docs
>> 16353 docs/s; 0.474 MB/s
>>
>> ... and with ocr stored, even _not_ requesting ocr with fl=... with
>> disabled <documentCache class="solr.LRUCache" ... /> and
>> <enableLazyFieldLoading>false</enableLazyFieldLoading>
>> [ with <documentCache and <enableLazyFieldLoading results are even worser ]
>>
>> ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
>> jb@serv7:~> perl solr-performance.pl zeit 6
>> http://127.0.0.1:58983/solr/collection1/select
>> ?wt=json
>> &q={%21q.op%3dAND}ocr%3A%28zeit%29
>> &fq=mashed_b%3Afalse
>> &fl=id
>> &sort=sort_name_s asc,id+asc
>> &rows=1000000
>> time: 61.58 s
>> bytes: 1.878 MB
>> 64768 docs found; got 64768 docs
>> 1052 docs/s; 0.030 MB/s
>>
>> ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
>> jb@serv7:~> perl solr-performance.pl zeit 6
>> http://127.0.0.1:58983/solr/collection1/select
>> ?wt=json&q={%21q.op%3dAND}ocr%3A%28zeit%29
>> &fq=mashed_b%3Afalse
>> &fl=id
>> &sort=sort_name_s asc,id+asc
>> &rows=1000000
>> time: 58.80 s
>> bytes: 1.878 MB
>> 64768 docs found; got 64768 docs
>> 1102 docs/s; 0.032 MB/s
>>
>> Is there any reason why stored vs non-stored is 16 times slower?
>> Is there a way to "store ocr" field in a separate index or somethings
>> like this?
>>
>> Kind regards,
>> J. Barth
>>
>>
>>
>>
>> --
>> J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580
>>
>> pgp public key:
>> http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

Re: Stored vs non-stored very large text fields

Reply via email to