On 07/28/2014 03:07 PM, Joe Landman wrote:
On 7/28/14, 2:55 PM, Prentice Bisbal wrote:
On 07/28/2014 01:29 PM, Jeff White wrote:
Power draw will vary greatly depending on many factors. Where I am
at we currently have 16 racks of HPC equipment (compute nodes,
storage, network gear, etc.) using about 140kVA but can use up to
160 kVA. A single rack with 26 compute nodes each with 64 cores
worth of AMD 6276 (Supermicro boxes) is using about 18 kW across the
PDUs, 3 phase at 240 volts, with most of the nodes at 100% CPU usage.
Agreed there's a lot of variability. Since I don't exactly what's
going in my new space yet, I'm looking for everyone's input to come
up with an average, or ballpark amount. the 5 - 10 kW one vendor
specified seems waaaay too low for a rack of high-density HPC nodes
running at or near 100% utilization.
Seriously, don't design for average, shoot for worst case scenario.
Nothing suck so much as having too low of a power or cooling budget
and a big new shiny that can't be fully turned on thanks to that.
This is exactly what I'm trying to do. I assume HPL will provide a worst
case scenario, based on the average of everyone else's worst case
scenario. I know that doesn't make sense, but I need to eliminate
outliers that are extremely high density, like HP's new Apollo systems.
If my systems don't have enough power to run HPL, I can't even perform
acceptance testing!
I can't speak to what other vendors say/do in this regard, but I can
say that we try to make sure we never use more than 50% of the
capacity of any particular PDU, and that the PDUs have enough head
room to be able to handle sudden loads (say one of the PDUs falling
over).
In engineering, they call this a safety factor. When I was in school, a
common safety factory was something like worst case scenario + 20%, but
extreme safety considerations, like bridges or amusement park rides, got
a much higher safety factor.
We've had a situation (years ago) where we were pressed not to
"over-spec" the power, and despite our protests, this is what was
installed. First time a PDU tripped a breaker (did I mention that
they overloaded our original design? No? Well ...), all the load hit
the second PDU, full force. This was not pretty.
The cost to "over spec" is in the noise relative to the opportunity
cost for under spec'ing, not to mention the "additional" cost of more
power (and cooling ... don't forget the cooling!).
I agree. If I overspec, no one will notice, except the accountants. If I
underspec, and we can't use the datacenter at it's designed capacity,
everyone will notice, and it will be an embarassment for our group.
You can set the maximum boundary on power pretty easily with maximum
draw per node and basic math. This ignores inrush current and power,
but lets assume you do a phased power on (1-3 second intervals between
nodes). If you want to hit all the power buttons at once, just make
sure you have enough headroom for that inrush.
Its not a dark art per se, but be quite aggressive in what you think
your power draws are going to be. Use that to set your upper bound,
and assume you don't want to run your PDUs to 75% capacity normally
(though under extreme load with half of your other PDUs offline, this
isn't a bad target).
I want to be very aggressive and allow excess capacity as a safety
margin and for future growth, but we hitting our budget limits, and some
are trying to 'right size' our power and cooling, which I'm afraid could
be disastrous. Some involved in the discussion have stated only 5 - 10
kW per full rack, which is too small. Since I don't know exactly what
systems I'm going to get from my RFP, I can't do exact calculations
based on specific models. I could do a few different models, but that
can be time consuming, and it's not always easy to get all that
information from the vendors.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf