So, more or less, the install day was a major success for us here at Purdue. The party got started at 8:00AM local time..

We had around 40 peoples unboxing machines in the loading dock. After an hour, they had gone through nearly 20 pallets of boxes. (We asked them to take a break for some of the free breakfast we had..) The bottleneck in getting machines racked was the limited isle space between the two rack rows and needing to get the rails ahead of the actual machines.

Around 12:00PM enough was unpacked, racked, and cabled to begin software installation. By 1:00PM, 500 nodes were up and jobs were running. By 4:00PM, everything was going. This morning, we hit the last few mis-installs. Our DOA nodes were around 1% of the total order..

One of our nanotech researchers here got in a hero run of his code, and pronounced the cluster perfect early this morning. Not a bad turn around and a very happy costumer.

We were blown away by how quickly the teams moved through their jobs. Of course, it wasn't surprising because we pulled a lot of the technical talent from IT shops all around the University to work in two hours shifts. It was a great time to socialize and get to know the faces behind the emails. The massive preparation effort that took place before hand brought the research computing group, the central networking group and the data center folks together in ways that hadn't happened before.

The physical networking was done in a new way for us.. We used a large Foundry switch and the MRJ21 cabling system for it. Each racks gets 24 nodes, a 24 port passive patch panel, and 4 MRJ21 cables that run back to the network switch. Then, there are just short patch cables between the panel and each node in the rack (running through a side mounted cable manager). Eventually, there'll be a cheap 24port 100mbps switch in each rack to provide dedicated out of band management to each node.

Most of the cabling was done by two person teams. One person unwrapping cables and the other running the cables in the rack. This process wasn't the speediest, but things certainly look nice on the backside..

The installation infrastructure was revitalized for this install. We normally kickstart each node and then set up cfengine to run on the first boot. Cfengine will go ahead and bring the node into the cluster. To support this new cluster, we took five Dell 1850's and turned them into an IPVS cluster. One was the manager, the others serving bots. They ran cfengine and apache (providing both cfengine and the kickstart packages).

Since we use RedHat Enterprise for the OS on each node, we upgraded the campus proxy server from a Dell 2650 to a beefy Sun x4200. To keep a lot of load off the proxy, we kickstarted using the latest release of Rhel4.

So, there are some of the nitty details of what it took to get this thing off the group in just a few hours.

--
Alex Younts

Jim Lux wrote:
At 03:20 PM 5/6/2008, Mark Hahn wrote:
We have built out a beefy install infrastructure to support a lot of simultaneous installs...

I'm curious to hear about the infrastructure.

btw:
http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=207501882

Interesting...

1000 computers, assume it takes 30 seconds to remove from the box and walk to the rack. that's 30,000 seconds, or about 500 minutes.. call it 8 hours. Assume you've got 10 racks and 10 people, so you get some parallelism... an hour to unpack and rack one pile.


What wasn't shown in the video.. all the plugging and routing of network cables?

Jim
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to