Hello, I'd like to announce the release of the 0.1 version of ParBASH. Using ParBASH, it is possible to write bash scripts that can be automatically parallelized on SMP, multicore, and distributed systems using Apache Hadoop.
Here is an example script to find top 10 references for Barack Obama pages on wikipedia using Amazon EC2: wiki.sh: cat hdfs:/wikipedia-out/* | grep Obama | \ perl -ne 'while (/<link type="external" href="([^"]+)">/g) { print "$1\n"; }' |\perl -ne 'if (/http:\/\/([^\/]+)(\/|$)/) { print "$1\n"; }' |\ perl -ne ' if (/([^\.]\.)+([^\.]+\.[a-zA-Z]{2,3}\.[^\.]+)$/) { print "$2\n";} else if (/([^\.]+\.[a-zA-Z]{2,3}\.[^\.]+)$/) { print "$1\n";} else if (/([^\.]\.)*([^\.]+\.[^\.]+)$/) { print "$2\n"; }' |\ sort | uniq -c > hdfs:/out How and why of wiki.sh and parbash on http://cloud-dev.blogspot.com/2009/06/introduction-to-parbash.html Source code and more examples: http://code.google.com/p/parbash If someone wants to try compiling the code and play around with it, please contact me, I can help you get started. Thanks, Milenko