Re: Fwd: Solr dynamic field blowing up the index size

Pratik Patel Tue, 21 Feb 2017 09:04:09 -0800

I am using the schema from solr 5 which does not have any field with
docValues enabled.In fact to ensure that everything is same as solr 5
(except the breaking changes) I am using the solrconfig.xml also from solr
5 with schemaFactory set as classicSchemaFactory to use schema.xml from
solr 5.


On Tue, Feb 21, 2017 at 11:33 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> Did you reuse the schema or rebuilt it on top of the latest examples?
> Because the latest example schema enabled docValues for strings on the
> fieldType level.
>
> I would do a diff of the schemas to see what changed. If they look
> very different and you are looking for tools to normalize/extract
> elements from schemas, you may find my latest Revolution presentation
> useful for that:
> https://www.slideshare.net/arafalov/rebuilding-solr-6-
> examples-layer-by-layer-lucenesolrrevolution-2016
> (e.g. slide 20). There is also the video there at the end.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 21 February 2017 at 11:18, Mike Thomsen <mikerthom...@gmail.com> wrote:
> > Correct me if I'm wrong, but heavy use of doc values should actually blow
> > up the size of your index considerably if they are in fields that get
> sent
> > a lot of data.
> >
> > On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel <pra...@semandex.net>
> wrote:
> >
> >> Thanks for the reply. I can see that in solr 6, more than 50% of the
> index
> >> directory is occupied by ".nvd" file extension. It is something related
> to
> >> norms and doc values.
> >>
> >> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >> wrote:
> >>
> >> > Did you look in the data directories to check what index file
> extensions
> >> > contribute most to the difference? That could give a hint.
> >> >
> >> > Regards,
> >> >     Alex
> >> >
> >> > On 21 Feb 2017 9:47 AM, "Pratik Patel" <pra...@semandex.net> wrote:
> >> >
> >> > > Here is the same question in stackOverflow for better format.
> >> > >
> >> > > http://stackoverflow.com/questions/42370231/solr-
> >> > > dynamic-field-blowing-up-the-index-size
> >> > >
> >> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app
> fine
> >> > but
> >> > > the problem is that index size with solr 6 is way too large. In
> solr 5,
> >> > > index size was about 15GB and in solr 6, for the same data, the
> index
> >> > size
> >> > > is 300GB! I am not able to understand what contributes to such huge
> >> > > difference in solr 6.
> >> > >
> >> > > I have been able to identify a field which is blowing up the size of
> >> > index.
> >> > > It is as follows.
> >> > >
> >> > > <dynamicField name="*_note" type="text_general" indexed="true"
> >> > > stored="true" multiValued="true"  />
> >> > >
> >> > > <field name="textproperty" type="text_general" indexed="true"
> >> > > stored="false" multiValued="true"  />
> >> > > <copyField source="*_note" dest="textproperty"/>
> >> > >
> >> > > When this field is commented out, the index size reduces to less
> than
> >> > 10GB.
> >> > >
> >> > > This field is of type text_general. Following is the definition of
> this
> >> > > type.
> >> > >
> >> > > <fieldType name="text_general" class="solr.TextField"
> >> > > positionIncrementGap="100">
> >> > >       <analyzer type="index">
> >> > >         <charFilter class="solr.HTMLStripCharFilterFactory" />
> >> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >> > >         <filter class="solr.LowerCaseFilterFactory"/>
> >> > >         <charFilter class="solr.PatternReplaceCharFilterFactory"
> >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >> > >         <filter class="solr.WordDelimiterFilterFactory"
> >> > > protected="protwords.txt" generateWordParts="1"
> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >> > > catenateAll="0" splitOnCaseChange="0"/>
> >> > >         <filter class="solr.KStemFilterFactory" />
> >> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> >> > > />
> >> > >       </analyzer>
> >> > >       <analyzer type="query">
> >> > >         <charFilter class="solr.HTMLStripCharFilterFactory" />
> >> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >> > >         <filter class="solr.LowerCaseFilterFactory"/>
> >> > >         <charFilter class="solr.PatternReplaceCharFilterFactory"
> >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >> > >         <filter class="solr.WordDelimiterFilterFactory"
> >> > > protected="protwords.txt" generateWordParts="1"
> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >> > > catenateAll="0" splitOnCaseChange="0"/>
> >> > >         <filter class="solr.KStemFilterFactory" />
> >> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> >> > > />
> >> > >       </analyzer>
> >> > >   </fieldType>
> >> > >
> >> > > Few things which I did to debug this issue:
> >> > >
> >> > >    - I have ensured that field type definition is same as what I was
> >> > using
> >> > >    in solr 5 and it is also valid in version 6. This field type
> >> > considers a
> >> > >    list of "stopwords" to be ignored during indexing. I have
> supplied
> >> the
> >> > > same
> >> > >    list of stopwords which we were using in solr 5. I have verified
> >> that
> >> > > path
> >> > >    of this file is correct and it is being loaded fine in solr admin
> >> UI.
> >> > > When
> >> > >    I analyse these fields using "Analysis" tab of the solr admin
> UI, I
> >> > can
> >> > > see
> >> > >    that stopwords are being filtered out. However, when I query with
> >> some
> >> > > of
> >> > >    these stopwords, I do get the results back which makes me think
> that
> >> > >    probably stopwords are being indexed.
> >> > >
> >> > > Any idea what could increase the size of index by so much in solr 6?
> >> > >
> >> >
> >>
>

Re: Fwd: Solr dynamic field blowing up the index size

Reply via email to