: We are using Solr 7.1.0 to index a database of addresses.  We have found 
: that our index size increases massively when we add one extra field to 
: the index, even though that field is stored and not indexed, and doesn’t 

what about docValues?

: When we run an index load without the problematic field present, the 
: Solr index size is 5.5GB.  When we add the field into the index, the 
: size grows to 13.3GB.  The field itself is a maximum of 46 characters in 
: length and on average is 19 characters. We have ~14,000,000 rows in 
: total to index of which only ~200,000 have this field present at all 
: (i.e. not null in database).  Given that we don’t want to index the 
: field, only store it I would have thought (perhaps naively) that the 
: storage increase would be approximately 200,000 * 19 = 3.8M bytes = 
: 3.6MB rather than the 7.5GB we are seeing.

if the field has docValues enabled, then there will be some overhead for 
every doc in the index -- even the ones that don't have a value in this 
field.  (allthough i'd still be very suprised if it accounted for 7G)

: - The problematic field is created through the API as follows:
: 
:   curl -X POST -H 'Content-type:application/json' --data-binary '{
:     "add-field":{
:       "name":"buildingName",
:       "type":"string",
:       "stored":true,
:       "indexed":false
:     }
:   }' http://localhost:8983/solr/address/schema

...that's going to cause the field to inherit any (non-overridden) 
settings from the fieldType "string" -- in the 7.1 _default configset, 
"string" is defined with docValues="true"

You can see *all* properties set on a field -- regardless of wether they 
are set on the fieldType, or are implicit hardcoded defaults in the 
implementation of the fieldType via the 'showDefaults=true' Schema API 
option.

Consider these API examples from the techproducts demo...

$ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat'
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"cat",
    "type":"string",
    "multiValued":true,
    "indexed":true,
    "stored":true}}

$ curl 
'http://localhost:8983/solr/techproducts/schema/fields/cat?showDefaults=true'
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"cat",
    "type":"string",
    "indexed":true,
    "stored":true,
    "docValues":false,
    "termVectors":false,
    "termPositions":false,
    "termOffsets":false,
    "termPayloads":false,
    "omitNorms":true,
    "omitTermFreqAndPositions":true,
    "omitPositions":false,
    "storeOffsetsWithPositions":false,
    "multiValued":true,
    "large":false,
    "sortMissingLast":true,
    "required":false,
    "tokenized":false,
    "useDocValuesAsStored":true}}







-Hoss
http://www.lucidworks.com/

Reply via email to