In those famous words from "Cool Hand Luke," "What we have here is a failure to communicate." For my role in that failure I apologize.

Tony Travis wrote:


I think problems can occur when you enforce such a strict demarcation boundary between your role and the role of the scientists you support: If communication between you and the scientists breaks down and you do not understand what they want to do then you cannot support their work effectively. The bottom line for me is that it is the objective of the organisation as a whole to produce research, and your role within the organisation is to facilitate the work of the scientists who do it.

Agreed. The point that I'm making is that managing a resource for one individual or group is different than managing it for multiple individuals or groups. It is my role within the organization to support the work of ALL of the scientists. That means that for instance that we use a batch system. It means that we have limits on the number of jobs anyone can submit. It does not mean that I don't listen to what other people need.


I'm a scientist with a Linux box at my house: I also built and manage a small (64-1p node) openMosix Beowulf cluster for bioinformatics work at RRI and for the bioinformatics/mathematical work of our sister organisation BioSS. I don't think I'm exceptional in doing this, but I do think that having a Linux box at home has been very useful to me in gaining the experience I needed to manage our Beowulf cluster.

Not 'everyone' like me is as stupid or naive as you imply. I have the support of an excellent IT department and an electronics workshop who talk to me and understand very well what I want to do with the Beowulf. We have about 400 user accounts, which are registered and managed by IT centrally. I just enable NIS. The IT department also manage the central filers where precious data files are stored. I manage 3.2 TB of local RAID on the Beowulf. In my opinion this type of cooperation is a lot more effective than strict job demarcation...

For the record, I implied no stupidity and no naivete. I don't manage a machine for an individual or even a department. I manage it for the institution. I work with individuals to meet their needs. Meeting these needs has helped us to grow our resources from 10's of processors to hundreds. It has provided researchers the resources to win millions of dollars in grants that we couldn't have competed for without the cooperation that we've built.



Seems to me that it would be straight-forward to know this if you use a package management system like apt or rpm, which keeps track of what's installed and what the dependencies are. However, I also think that it's quite right that you should know more about this than him. In an ideal world, you should both make the decision about what to do on a rational basis. I doubt that he asked you to do it for no reason at all.

I don't work with much scientific software that is available in rpm form. Some of it is binary. But most is compiled specifically for our machines using high performance compilers from absoft, intel, pathscale or the Portland Group. So apt and rpm don't solve the problem. After discussing the issue with the PI, we discovered that she didn't need the most recent version (as I think that I noted in my original reply). The version that was installed would work.



Most of the problems I've come accross like this arise from a lack of communication. I believe it's quite important for you to know why he wanted to do the upgrade, and for you to inform him about any problems or conflicts of interest that would result from the upgrade. Presumably, that is exactly what you did. My only complaint here is the impression you give that scientists like me want to upgrade software just for the sake of doing it. Please ask yourself why did the upstream maintainers release a new version? Was it just for the sake of upgrading it?

The person who wanted the upgrade was not the PI it was a member of our staff. When I asked him to research the needed changes and talked with the PI, the upgrade was not necessary.

I do advocate upgrading unless there is a reason *not* to
do it. You seem to recommend the opposite of not upgrading unless there *is* a reason to do it. I wonder which strategy results in less work?


Finally, in general, the uptime of my clusters are measured in years. My originial cluster purchased from Paralogic ran with downtime of less than 3 hours in 5 years. I think that works out to 99.9999% uptime.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to