Ok, now the situation is clearer ! Can you check the stored data size as Daniel correctly suggested ?
You are using a recent version of Solr, so your stored data should be properly compressed. An other idea that comes to my mind is related your merging policy. Are you merging segments often or not ? Are many deletion happening ? Deleted documents spaces is claimed back only after a segment merge. Maybe you are in that situation. You are not indexing a lot of documents, so you are already not storing all the inverted index data structures for those fields. You don't have termVectors either which usually bring more space. Let's understand better this, cause will be quite interesting to understand why simply storing you get 3X of your original document size. Cheers 2015-07-22 12:47 GMT+01:00 Daniel Collins <danwcoll...@gmail.com>: > Why are most of your fields stored but not indexed? That suggests to me > that you are using Solr as your primary data store, not as an index (which > is not Solr's ideal use case) > > Secondly, I think there is confusion around the term "segments". You have > a field called segment in your schema, but segments in Lucene terms means > parts of the index. So to clarify, when you say your "segments" size is > 8.4Gb, I assume you mean the input data you are putting in the segments > field? > > If you look at the files in your index, you can see the different elements > that make up the index, > > https://lucene.apache.org/core/4_7_2/core/org/apache/lucene/codecs/lucene46/package-summary.html#package_description > gives the full description of all the different elements for your version. > As Alessandro says, based on your schema the field data (.fdt) files are > probably the largest part of your index? > > You should be able to see how the index breaks down in terms of data, from > there you can work out how to tweak your schema. > > Remember that all your fields are stored, so the index size will always be > the size of all the stored data, plus all the indexes needed. Solr's > efficiency is around the indexed data, and it does sometimes trade off more > disk space for greater speed in reading, so you will have to bear that in > mind. > > > On 22 July 2015 at 12:29, Emir Arnautovic <emir.arnauto...@sematext.com> > wrote: > > > Is this test index? Do you rewrite documents with same ids? Did you try > to > > optimize index? > > > > Emir > > > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > On 22.07.2015 13:10, Daniel Holmes wrote: > > > >> Upayavira number of docs in that case is 140275. The solr memory is > 30Gb. > >> > >> Yes Emir I need most of them to be saved. > >> > >> I don't know Alessandro is that usual to use disk for indexing more than > >> 3x > >> of document size and presumably it will grow up in continue of crawl > >> exponentially... Its so suboptimal I think. > >> > >> > >> On Wed, Jul 22, 2015 at 3:16 PM, Alessandro Benedetti < > >> benedetti.ale...@gmail.com> wrote: > >> > >> "In one case for instance my segments size is 8.4G while index size is > >>> 28G!!! It seems unusual…" > >>> > >>> The index is a collection of index segments + few overhead . > >>> So, do you simply mean you have 4 segments ? > >>> Where is the problem anyway ? > >>> You are also storing content which usually is a big part of the index. > >>> As Upaya said, I am curious to know why you are so surprised ! > >>> > >>> Cheers > >>> > >>> 2015-07-22 11:27 GMT+01:00 Daniel Holmes <noora.sa...@gmail.com>: > >>> > >>> Hi All > >>>> I have problem with index size in solr 4.7.2. My OS is Ubuntu 14.10 > >>>> > >>> 64-bit. > >>> > >>>> my fields are : > >>>> > >>>> <field name="id" type="string" stored="true" indexed="true"/> > >>>> <field name="segment" type="string" stored="true" indexed="false"/> > >>>> <field name="url" type="url_text" stored="true" indexed="true" > >>>> required="true"/> > >>>> <field name="outlink" type="url_text" stored="true" indexed="true" > >>>> required="true"/> > >>>> <field name="content" type="text_general" stored="true" > indexed="true"/> > >>>> <field name="title" type="text_general" stored="true" indexed="true"/> > >>>> <field name="host" type="url" stored="false" indexed="true"/> > >>>> <field name="segment" type="string" stored="true" indexed="false"/> > >>>> <field name="boost" type="float" stored="true" indexed="false"/> > >>>> <field name="digest" type="string" stored="true" indexed="false"/> > >>>> <field name="tstamp" type="date" stored="true" indexed="false"/> > >>>> > >>>> In one case for instance my segments size is 8.4G while index size is > >>>> 28G!!! It seems unusual... > >>>> > >>>> What suggestions do you have to reduce index size? > >>>> Is there any way to check disk usage details in cores? e.g. stop > words, > >>>> stored docs, etc. > >>>> > >>>> > >>> > >>> -- > >>> -------------------------- > >>> > >>> Benedetti Alessandro > >>> Visiting card - http://about.me/alessandro_benedetti > >>> Blog - http://alexbenedetti.blogspot.co.uk > >>> > >>> "Tyger, tyger burning bright > >>> In the forests of the night, > >>> What immortal hand or eye > >>> Could frame thy fearful symmetry?" > >>> > >>> William Blake - Songs of Experience -1794 England > >>> > >>> > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England