Segments are on a per-field basis... so doesn't it depend on how many fields
are merged in parallel? I mean, when most people say "index size" they are
referring to all fields, collectively, not individual fields. I'm just
wondering how number of processor cores might affect things (more cores
might make the worst case scenario worse since it will maximize the amount
of data processed at a given moment.)
But, I suppose in the final analysis, it may all average out. It may not be
exactly the worst case, but maybe close enough.
And all of this depends on which merge policy you choose. With the default
tiered merge policy things shouldn't be so bad as the 3x worst case.
-- Jack Krupansky
-----Original Message-----
From: Walter Underwood
Sent: Thursday, April 11, 2013 10:40 AM
To: solr-user@lucene.apache.org
Subject: Re: Approximately needed RAM for 5000 query/second at a Solr
machine?
Here is the situation where merging can require 3X space. It can only happen
if you force merge, then index with merging turned off, but we had Ultraseek
customers do that.
* All documents are merged into a single segment.
* Without a merge, all documents are replaced.
* This results in one segment of deleted documents and one of new documents
(2X).
* A merge takes place, creating a new segment of the same size, thus 3X.
For normal operation, 2X is plenty of room.
wunder
On Apr 11, 2013, at 6:46 AM, Michael Ryan wrote:
I've investigated this in the past. The worst case is 2*indexSize
additional disk space (3*indexSize total) during an optimize.
In our system, we use LogByteSizeMergePolicy, and used to have a
mergeFactor of 10. We would see the worst case happen when there were
exactly 20 segments (or some other multiple of 10, I believe) at the start
of the optimize. IIRC, it would merge those 20 segments down to 2
segments, and then merge those 2 segments down to 1 segment. 1*indexSize
space was used by the original index (because there is still a reader open
on it), 1*indexSpace was used by the 2 segments, and 1*indexSize space was
used by the 1 segment. This is the worst case because there are two full
additional copies of the index on disk. Normally, when the number of
segments is not a multiple of the mergeFactor, there will be some part of
the index that was not part of both merges (and this part that is excluded
usually would be the largest segments).
We worked around this by doing multiple optimize passes, where the first
pass merges down to between 2 and 2*mergeFactor-1 segments (based on a
great tip from Lance Norskog on the mailing list a couple years ago).
I'm not sure if the current merge policy implementations still have this
issue.
-Michael
-----Original Message-----
From: Furkan KAMACI [mailto:furkankam...@gmail.com]
Sent: Thursday, April 11, 2013 2:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Approximately needed RAM for 5000 query/second at a Solr
machine?
Hi Walter;
Is there any document or something else says that worst case is three
times of disk space? Twice times or three times. It is really different
when we talk about GB's of disk spaces.
2013/4/10 Walter Underwood <wun...@wunderwood.org>
Correct, except the worst case maximum for disk space is three times.
--wunder
On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote:
You're mixing up disk and RAM requirements when you talk about
having twice the disk size. Solr does _NOT_ require twice the index
size of RAM to optimize, it requires twice the size on _DISK_.
In terms of RAM requirements, you need to create an index, run
realistic queries at the installation and measure.
Best
Erick
On Tue, Apr 9, 2013 at 10:32 PM, bigjust <bigj...@lambdaphil.es> wrote:
On 4/9/2013 7:03 PM, Furkan KAMACI wrote:
These are really good metrics for me:
You say that RAM size should be at least index size, and it is
better to have a RAM size twice the index size (because of worst
case scenario).
On the other hand let's assume that I have a RAM size that is
bigger than twice of indexes at machine. Can Solr use that extra
RAM or is it a approximately maximum limit (to have twice size
of indexes at machine)?
What we have been discussing is the OS cache, which is memory
that is not used by programs. The OS uses that memory to make
everything run faster. The OS will instantly give that memory up
if a program requests it.
Solr is a java program, and java uses memory a little
differently, so Solr most likely will NOT use more memory when it is
available.
In a "normal" directly executable program, memory can be
allocated at any time, and given back to the system at any time.
With Java, you tell it the maximum amount of memory the program
is ever allowed to use. Because of how memory is used inside
Java, most long-running Java programs (like Solr) will allocate
up to the configured maximum even if they don't really need that much
memory.
Most Java virtual machines will never give the memory back to the
system even if it is not required.
Thanks, Shawn
Furkan KAMACI <furkankam...@gmail.com> writes:
I am sorry but you said:
*you need enough free RAM for the OS to cache the maximum amount
of disk space all your indexes will ever use*
I have made an assumption my indexes at my machine. Let's assume
that it is 5 GB. So it is better to have at least 5 GB RAM? OK,
Solr will use RAM up to how much I define it as a Java processes.
When we think about the indexes at storage and caching them at RAM
by OS, is that what you talk about: having more than 5 GB - or -
10 GB RAM for my machine?
2013/4/10 Shawn Heisey <s...@elyograg.org>
10 GB. Because when Solr shuffles the data around, it could use up
to twice the size of the index in order to optimize the index on disk.
-- Justin
--
Walter Underwood
wun...@wunderwood.org
--
Walter Underwood
wun...@wunderwood.org