-------------- Original message -------------- 
From: Ashley Pittman <[EMAIL PROTECTED]> 

> I saw a talk which said SMT was worth a maximum of 20% on power5 and 
> often performed worse than if it had been tured off. This correlates 
> well with my experience of it on Intel CPUs. 

As Joe Landman suggested the notion of a thread (as a logical construct 
representing parallelizable work) can be reduced to a single instruction.  In 
this case, the logical distance between the two work loads is minimal and 
managed by OoO hardware (or VLIW).  As the separation of the parallel workloads 
(threads) grows we have parallel threads within one code that are defined by 
code blocks, and then workloads in different processes with the same MPI 
application, to workloads separated  by an even greater logical distant in 
different applications, and finally to thread groups virtualized across OS 
environments.  
As Mark H. points out the functional units do not care whom is parent to its 
work.  Still, the problem of shepherding the result back to it proper pen grows 
with the distance of logical separation.  Hardware resources and chip surface 
area are required to manage this. That is one reason why Intel delayed SMT in 
Clovertown and Harpertown and why many-core advocates think that threads are a 
waste of chip space, especially in a data parallel universe.  
We wish to more fully utilize under utilized functional unit resources in a 
core of course, but the as Ashley P. intimated as we pile threads 
disproportionately on top of ever growing parallel hardware the chance of a 
delaying collision grows in a non-linear way.  Thus, the expanded variance.  
Put another way the gap between trivial or random schedules through the 
hardware and optimal ones grows  (Like the distance between a pair and a 
straight flush in poker) as both workloads and resouces thread.  If we are 
allocating resouces based on sampling, we then run into the problem of not 
being able to discover where the idle resouces are.  This is visible in scaling 
tests of very fat server nodes using the VMmark benchmark.  Efficiency drops 
off even with benchmarks scaled weakly.  Are there better alternatives?  Well, 
at the instruction level, we have the VLIW -- prepacked workloads known not to 
intefer with each other. What about at the level of schedulers, which as I 
understan!
 d it ar
e all sampling based ... ?? There is the notion of resource-requirement-aware 
scheduling which intends to eliminate resource collisions in advance for 
virtualized work loads.  The Cray XMT uses hardware resources to insulate a 
large groups of parallel workloads (at the expense of individual or related 
ones sometimes) from interference that might idle useable resources if 
additional more or less distant work was not available.
This discussion invokes wild thoughts ... like the notion of compile multiple 
applications together in a cluster ... and running them together knowing that 
the compiler can shuffle the work together smartly with the need for additional 
hardware resources to do it.
Are folks hear familiar with eXludus' resource-requirement-aware scheduling 
technology?
Sorry about the length ... but it is an interesting topic.
Regards,
rbw
-- 

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to