Re: [Beowulf] Power draw of cluster nodes under heavy load

Prentice Bisbal Mon, 28 Jul 2014 14:15:12 -0700


On 07/28/2014 03:07 PM, Joe Landman wrote:

On 7/28/14, 2:55 PM, Prentice Bisbal wrote:
On 07/28/2014 01:29 PM, Jeff White wrote:
Power draw will vary greatly depending on many factors. Where I amat we currently have 16 racks of HPC equipment (compute nodes,storage, network gear, etc.) using about 140kVA but can use up to160 kVA. A single rack with 26 compute nodes each with 64 coresworth of AMD 6276 (Supermicro boxes) is using about 18 kW across thePDUs, 3 phase at 240 volts, with most of the nodes at 100% CPU usage.
Agreed there's a lot of variability. Since I don't exactly what'sgoing in my new space yet, I'm looking for everyone's input to comeup with an average, or ballpark amount. the 5 - 10 kW one vendorspecified seems waaaay too low for a rack of high-density HPC nodesrunning at or near 100% utilization.
Seriously, don't design for average, shoot for worst case scenario.Nothing suck so much as having too low of a power or cooling budgetand a big new shiny that can't be fully turned on thanks to that.

This is exactly what I'm trying to do. I assume HPL will provide a worstcase scenario, based on the average of everyone else's worst casescenario. I know that doesn't make sense, but I need to eliminateoutliers that are extremely high density, like HP's new Apollo systems.If my systems don't have enough power to run HPL, I can't even performacceptance testing!

I can't speak to what other vendors say/do in this regard, but I cansay that we try to make sure we never use more than 50% of thecapacity of any particular PDU, and that the PDUs have enough headroom to be able to handle sudden loads (say one of the PDUs fallingover).

In engineering, they call this a safety factor. When I was in school, acommon safety factory was something like worst case scenario + 20%, butextreme safety considerations, like bridges or amusement park rides, gota much higher safety factor.

We've had a situation (years ago) where we were pressed not to"over-spec" the power, and despite our protests, this is what wasinstalled. First time a PDU tripped a breaker (did I mention thatthey overloaded our original design? No? Well ...), all the load hitthe second PDU, full force. This was not pretty.
The cost to "over spec" is in the noise relative to the opportunitycost for under spec'ing, not to mention the "additional" cost of morepower (and cooling ... don't forget the cooling!).

I agree. If I overspec, no one will notice, except the accountants. If Iunderspec, and we can't use the datacenter at it's designed capacity,everyone will notice, and it will be an embarassment for our group.

You can set the maximum boundary on power pretty easily with maximumdraw per node and basic math. This ignores inrush current and power,but lets assume you do a phased power on (1-3 second intervals betweennodes). If you want to hit all the power buttons at once, just makesure you have enough headroom for that inrush.
Its not a dark art per se, but be quite aggressive in what you thinkyour power draws are going to be. Use that to set your upper bound,and assume you don't want to run your PDUs to 75% capacity normally(though under extreme load with half of your other PDUs offline, thisisn't a bad target).

I want to be very aggressive and allow excess capacity as a safetymargin and for future growth, but we hitting our budget limits, and someare trying to 'right size' our power and cooling, which I'm afraid couldbe disastrous. Some involved in the discussion have stated only 5 - 10kW per full rack, which is too small. Since I don't know exactly whatsystems I'm going to get from my RFP, I can't do exact calculationsbased on specific models. I could do a few different models, but thatcan be time consuming, and it's not always easy to get all thatinformation from the vendors.


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Power draw of cluster nodes under heavy load

Reply via email to