I think I have found something concrete. Reading up more on nvd file
extension, I found that it is being used to store length and boost factors
for documents and fields. These are normalization files. Normalization on a
field is controlled by omitNorms attribute. If omitNorms=true then the
field will not be normalized. I explicitly added omitNorms=true for the
field type text_general and re-indexed the data. Now, my index size is much
smaller. I haven't yet verified this with complete data set yet but I can
see that index size is reduced. We have a large data set and it takes about
5-6 hours to index it completely so I'll index the whole data set overnight
to confirm the fix.

But now I am curious about omitNorms attribute. What would be the default
value for omitNorms for field type "text_general". The documentation says
that omitNorms=true for primitive field types like string, int etc. but I
don't know what is the default value for "text_general"?

I never had omitNorms set explicitly on text_general field type or any of
the fields having type text_general. Has the default value of omitNorms
been changed from solr 5.0.0 to 6.4.1?

Any clarification on this would be really helpful.

I am posting some relevant links here for someone who might face similar
issue in future.

http://apprize.info/php/solr_4/2.html
http://stackoverflow.com/questions/18694242/what-is-omitnorms-and-version-field-in-solr-schema
https://lucidworks.com/2009/09/02/scaling-lucene-and-solr/#d0e71

Thanks,
Pratik

On Tue, Feb 21, 2017 at 12:03 PM, Pratik Patel <pra...@semandex.net> wrote:

> I am using the schema from solr 5 which does not have any field with
> docValues enabled.In fact to ensure that everything is same as solr 5
> (except the breaking changes) I am using the solrconfig.xml also from solr
> 5 with schemaFactory set as classicSchemaFactory to use schema.xml from
> solr 5.
>
>
> On Tue, Feb 21, 2017 at 11:33 AM, Alexandre Rafalovitch <
> arafa...@gmail.com> wrote:
>
>> Did you reuse the schema or rebuilt it on top of the latest examples?
>> Because the latest example schema enabled docValues for strings on the
>> fieldType level.
>>
>> I would do a diff of the schemas to see what changed. If they look
>> very different and you are looking for tools to normalize/extract
>> elements from schemas, you may find my latest Revolution presentation
>> useful for that:
>> https://www.slideshare.net/arafalov/rebuilding-solr-6-exampl
>> es-layer-by-layer-lucenesolrrevolution-2016
>> (e.g. slide 20). There is also the video there at the end.
>>
>> Regards,
>>    Alex.
>> ----
>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>
>>
>> On 21 February 2017 at 11:18, Mike Thomsen <mikerthom...@gmail.com>
>> wrote:
>> > Correct me if I'm wrong, but heavy use of doc values should actually
>> blow
>> > up the size of your index considerably if they are in fields that get
>> sent
>> > a lot of data.
>> >
>> > On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel <pra...@semandex.net>
>> wrote:
>> >
>> >> Thanks for the reply. I can see that in solr 6, more than 50% of the
>> index
>> >> directory is occupied by ".nvd" file extension. It is something
>> related to
>> >> norms and doc values.
>> >>
>> >> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
>> >> arafa...@gmail.com>
>> >> wrote:
>> >>
>> >> > Did you look in the data directories to check what index file
>> extensions
>> >> > contribute most to the difference? That could give a hint.
>> >> >
>> >> > Regards,
>> >> >     Alex
>> >> >
>> >> > On 21 Feb 2017 9:47 AM, "Pratik Patel" <pra...@semandex.net> wrote:
>> >> >
>> >> > > Here is the same question in stackOverflow for better format.
>> >> > >
>> >> > > http://stackoverflow.com/questions/42370231/solr-
>> >> > > dynamic-field-blowing-up-the-index-size
>> >> > >
>> >> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app
>> fine
>> >> > but
>> >> > > the problem is that index size with solr 6 is way too large. In
>> solr 5,
>> >> > > index size was about 15GB and in solr 6, for the same data, the
>> index
>> >> > size
>> >> > > is 300GB! I am not able to understand what contributes to such huge
>> >> > > difference in solr 6.
>> >> > >
>> >> > > I have been able to identify a field which is blowing up the size
>> of
>> >> > index.
>> >> > > It is as follows.
>> >> > >
>> >> > > <dynamicField name="*_note" type="text_general" indexed="true"
>> >> > > stored="true" multiValued="true"  />
>> >> > >
>> >> > > <field name="textproperty" type="text_general" indexed="true"
>> >> > > stored="false" multiValued="true"  />
>> >> > > <copyField source="*_note" dest="textproperty"/>
>> >> > >
>> >> > > When this field is commented out, the index size reduces to less
>> than
>> >> > 10GB.
>> >> > >
>> >> > > This field is of type text_general. Following is the definition of
>> this
>> >> > > type.
>> >> > >
>> >> > > <fieldType name="text_general" class="solr.TextField"
>> >> > > positionIncrementGap="100">
>> >> > >       <analyzer type="index">
>> >> > >         <charFilter class="solr.HTMLStripCharFilterFactory" />
>> >> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
>> >> > >         <filter class="solr.LowerCaseFilterFactory"/>
>> >> > >         <charFilter class="solr.PatternReplaceCharFilterFactory"
>> >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> >> > >         <filter class="solr.WordDelimiterFilterFactory"
>> >> > > protected="protwords.txt" generateWordParts="1"
>> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> >> > > catenateAll="0" splitOnCaseChange="0"/>
>> >> > >         <filter class="solr.KStemFilterFactory" />
>> >> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
>> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
>> >> > > />
>> >> > >       </analyzer>
>> >> > >       <analyzer type="query">
>> >> > >         <charFilter class="solr.HTMLStripCharFilterFactory" />
>> >> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
>> >> > >         <filter class="solr.LowerCaseFilterFactory"/>
>> >> > >         <charFilter class="solr.PatternReplaceCharFilterFactory"
>> >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> >> > >         <filter class="solr.WordDelimiterFilterFactory"
>> >> > > protected="protwords.txt" generateWordParts="1"
>> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> >> > > catenateAll="0" splitOnCaseChange="0"/>
>> >> > >         <filter class="solr.KStemFilterFactory" />
>> >> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
>> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
>> >> > > />
>> >> > >       </analyzer>
>> >> > >   </fieldType>
>> >> > >
>> >> > > Few things which I did to debug this issue:
>> >> > >
>> >> > >    - I have ensured that field type definition is same as what I
>> was
>> >> > using
>> >> > >    in solr 5 and it is also valid in version 6. This field type
>> >> > considers a
>> >> > >    list of "stopwords" to be ignored during indexing. I have
>> supplied
>> >> the
>> >> > > same
>> >> > >    list of stopwords which we were using in solr 5. I have verified
>> >> that
>> >> > > path
>> >> > >    of this file is correct and it is being loaded fine in solr
>> admin
>> >> UI.
>> >> > > When
>> >> > >    I analyse these fields using "Analysis" tab of the solr admin
>> UI, I
>> >> > can
>> >> > > see
>> >> > >    that stopwords are being filtered out. However, when I query
>> with
>> >> some
>> >> > > of
>> >> > >    these stopwords, I do get the results back which makes me think
>> that
>> >> > >    probably stopwords are being indexed.
>> >> > >
>> >> > > Any idea what could increase the size of index by so much in solr
>> 6?
>> >> > >
>> >> >
>> >>
>>
>
>

Reply via email to