On 11/09/2012 06:20 AM, Igor Kozin wrote: > You nailed it! And not just the code, new codes appear all the time. > bowtie, bwa, soap2, soap3, bowtie2, snap ..
This was one of the major issues we found when we were pitching accelerators to VCs in the early 2000's. There were, at the time, something on the order of 200 phylogenetic codes. Many alignment codes. Many of code type X. Seemed that every lab had/used something different, so an accelerator had to be able to work on a generic set of problems, without complex rewriting of code. Moreover, and this might be my biases showing, the code quality wasn't ... erm ... high. Bad design patterns were in use, if any. We ran into one proteomic code whose authors/users claimed they had a great threading model, only to look in abject horror at object factories deep in nested loops. It was rare, with pretty much the exception of the HMMer code by Sean Eddy, that people were focused on performance with good coding/design/implementation. We did some pretty simple recoding of elements of this, and part of our larger group did some MPI and GPU work. Many of the concepts were folded into HMMer 3 (current version, should be the one people use). But this was done as he needed the tool to be faster, and he paid attention to the issues associated with this. This had not been true of (most of) the rest the last time I looked. This is important as one of the things that would tremendously benefit this community are libraries of routines that can be reviewed, collected, and used similar to BLAS. Then rather than writing your own implementation of a particular algorithm, leverage the basic working plumbing of these libraries to build your code (similar to LAPACK, etc.) Unfortunately, I see a rather significant "not invented here" viewpoint in some groups, and its not lacking here. Which means, in many cases, there are huge efforts expended to re-invent core algorithms. Less effort to build tools atop another set of tools. As many have noted, the parallelism is expressed through a scheduler. I remember calling this style of computing: high throughput. That is, more parallelism by wider distribution of computation, not maximization of single run performance across a larger machine. There's nothing intrinsically wrong with this, its a different way to do things. It also breaks critical assumptions in many HPC areas. I remember, having written ctblastall in 1999-2000 time frame that seamlessly distributed blast computations across a cluster, that we started running into issues with job schedulers. ctblastall would, in some of the larger cases, divide up the large blast run into tens of thousands of smaller jobs, and submit them to a cluster. Job schedulers, back then, for the most part, couldn't handle it. Platform's LSF could. Got a bit sluggish, but it worked. I included a number of optimizations to try to keep the throughput high, including early submission of bigger jobs, among other things like data motion optimization. Don't disparage this as not being high performance ... it is. Its just expressed/used differently. As to the point of code being written in Python/R/... Mebbe thats not such a good idea (Python). R, Matlab, Octave,... are interpreted as well. Compiled langs to a VM are ok (Java, Perl6, Julia), but best performance is going to statically compiled code. This said, the Julia people are doing their absolute best to be on par with statically compiled code, and its coming very close. But in my mind, its still a huge missed opportunity to not be using something akin to BLAS for bioinfo codes. And somewhat worse, these groups often are ... er ... influenced by computer science fads ... and you see that in their code. Some of these are hard to unwind into good code. The object factory design pattern is a great example of this. I am not sure why they do this, other than I see lots of academic collaboration between bioinfo folks and comp sci folks, and the comp sci folks need to publish papers, not write good code. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: land...@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf