I think I have found something concrete. Reading up more on nvd file extension, I found that it is being used to store length and boost factors for documents and fields. These are normalization files. Normalization on a field is controlled by omitNorms attribute. If omitNorms=true then the field will not be normalized. I explicitly added omitNorms=true for the field type text_general and re-indexed the data. Now, my index size is much smaller. I haven't yet verified this with complete data set yet but I can see that index size is reduced. We have a large data set and it takes about 5-6 hours to index it completely so I'll index the whole data set overnight to confirm the fix.
But now I am curious about omitNorms attribute. What would be the default value for omitNorms for field type "text_general". The documentation says that omitNorms=true for primitive field types like string, int etc. but I don't know what is the default value for "text_general"? I never had omitNorms set explicitly on text_general field type or any of the fields having type text_general. Has the default value of omitNorms been changed from solr 5.0.0 to 6.4.1? Any clarification on this would be really helpful. I am posting some relevant links here for someone who might face similar issue in future. http://apprize.info/php/solr_4/2.html http://stackoverflow.com/questions/18694242/what-is-omitnorms-and-version-field-in-solr-schema https://lucidworks.com/2009/09/02/scaling-lucene-and-solr/#d0e71 Thanks, Pratik On Tue, Feb 21, 2017 at 12:03 PM, Pratik Patel <pra...@semandex.net> wrote: > I am using the schema from solr 5 which does not have any field with > docValues enabled.In fact to ensure that everything is same as solr 5 > (except the breaking changes) I am using the solrconfig.xml also from solr > 5 with schemaFactory set as classicSchemaFactory to use schema.xml from > solr 5. > > > On Tue, Feb 21, 2017 at 11:33 AM, Alexandre Rafalovitch < > arafa...@gmail.com> wrote: > >> Did you reuse the schema or rebuilt it on top of the latest examples? >> Because the latest example schema enabled docValues for strings on the >> fieldType level. >> >> I would do a diff of the schemas to see what changed. If they look >> very different and you are looking for tools to normalize/extract >> elements from schemas, you may find my latest Revolution presentation >> useful for that: >> https://www.slideshare.net/arafalov/rebuilding-solr-6-exampl >> es-layer-by-layer-lucenesolrrevolution-2016 >> (e.g. slide 20). There is also the video there at the end. >> >> Regards, >> Alex. >> ---- >> http://www.solr-start.com/ - Resources for Solr users, new and >> experienced >> >> >> On 21 February 2017 at 11:18, Mike Thomsen <mikerthom...@gmail.com> >> wrote: >> > Correct me if I'm wrong, but heavy use of doc values should actually >> blow >> > up the size of your index considerably if they are in fields that get >> sent >> > a lot of data. >> > >> > On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel <pra...@semandex.net> >> wrote: >> > >> >> Thanks for the reply. I can see that in solr 6, more than 50% of the >> index >> >> directory is occupied by ".nvd" file extension. It is something >> related to >> >> norms and doc values. >> >> >> >> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch < >> >> arafa...@gmail.com> >> >> wrote: >> >> >> >> > Did you look in the data directories to check what index file >> extensions >> >> > contribute most to the difference? That could give a hint. >> >> > >> >> > Regards, >> >> > Alex >> >> > >> >> > On 21 Feb 2017 9:47 AM, "Pratik Patel" <pra...@semandex.net> wrote: >> >> > >> >> > > Here is the same question in stackOverflow for better format. >> >> > > >> >> > > http://stackoverflow.com/questions/42370231/solr- >> >> > > dynamic-field-blowing-up-the-index-size >> >> > > >> >> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app >> fine >> >> > but >> >> > > the problem is that index size with solr 6 is way too large. In >> solr 5, >> >> > > index size was about 15GB and in solr 6, for the same data, the >> index >> >> > size >> >> > > is 300GB! I am not able to understand what contributes to such huge >> >> > > difference in solr 6. >> >> > > >> >> > > I have been able to identify a field which is blowing up the size >> of >> >> > index. >> >> > > It is as follows. >> >> > > >> >> > > <dynamicField name="*_note" type="text_general" indexed="true" >> >> > > stored="true" multiValued="true" /> >> >> > > >> >> > > <field name="textproperty" type="text_general" indexed="true" >> >> > > stored="false" multiValued="true" /> >> >> > > <copyField source="*_note" dest="textproperty"/> >> >> > > >> >> > > When this field is commented out, the index size reduces to less >> than >> >> > 10GB. >> >> > > >> >> > > This field is of type text_general. Following is the definition of >> this >> >> > > type. >> >> > > >> >> > > <fieldType name="text_general" class="solr.TextField" >> >> > > positionIncrementGap="100"> >> >> > > <analyzer type="index"> >> >> > > <charFilter class="solr.HTMLStripCharFilterFactory" /> >> >> > > <tokenizer class="solr.StandardTokenizerFactory"/> >> >> > > <filter class="solr.LowerCaseFilterFactory"/> >> >> > > <charFilter class="solr.PatternReplaceCharFilterFactory" >> >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" /> >> >> > > <filter class="solr.WordDelimiterFilterFactory" >> >> > > protected="protwords.txt" generateWordParts="1" >> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1" >> >> > > catenateAll="0" splitOnCaseChange="0"/> >> >> > > <filter class="solr.KStemFilterFactory" /> >> >> > > <filter class="solr.StopFilterFactory" ignoreCase="true" >> >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/ >> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt" >> >> > > /> >> >> > > </analyzer> >> >> > > <analyzer type="query"> >> >> > > <charFilter class="solr.HTMLStripCharFilterFactory" /> >> >> > > <tokenizer class="solr.StandardTokenizerFactory"/> >> >> > > <filter class="solr.LowerCaseFilterFactory"/> >> >> > > <charFilter class="solr.PatternReplaceCharFilterFactory" >> >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" /> >> >> > > <filter class="solr.WordDelimiterFilterFactory" >> >> > > protected="protwords.txt" generateWordParts="1" >> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1" >> >> > > catenateAll="0" splitOnCaseChange="0"/> >> >> > > <filter class="solr.KStemFilterFactory" /> >> >> > > <filter class="solr.StopFilterFactory" ignoreCase="true" >> >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/ >> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt" >> >> > > /> >> >> > > </analyzer> >> >> > > </fieldType> >> >> > > >> >> > > Few things which I did to debug this issue: >> >> > > >> >> > > - I have ensured that field type definition is same as what I >> was >> >> > using >> >> > > in solr 5 and it is also valid in version 6. This field type >> >> > considers a >> >> > > list of "stopwords" to be ignored during indexing. I have >> supplied >> >> the >> >> > > same >> >> > > list of stopwords which we were using in solr 5. I have verified >> >> that >> >> > > path >> >> > > of this file is correct and it is being loaded fine in solr >> admin >> >> UI. >> >> > > When >> >> > > I analyse these fields using "Analysis" tab of the solr admin >> UI, I >> >> > can >> >> > > see >> >> > > that stopwords are being filtered out. However, when I query >> with >> >> some >> >> > > of >> >> > > these stopwords, I do get the results back which makes me think >> that >> >> > > probably stopwords are being indexed. >> >> > > >> >> > > Any idea what could increase the size of index by so much in solr >> 6? >> >> > > >> >> > >> >> >> > >