Hi Damian ---

Not a dumb question at all. Depending on the problem size and hardware in
use, various different performance factors could dominate (memory traffic
time, ability to use vector operations, network communication for a
cluster, etc.). We're not trying to achieve a specific performance result,
but rather to explore the question of *separate* the coding activity of
*hand-optimization* from the question of *getting the right result*. So,
our goal here is to write something that *uses* iterators and
classes/records to specify the result in a way that's fairly independent of
execution order and storage layout (the actual code is somewhat more
abstract than the one I emailed), and then express different optimizations
with various iterators and classes/records, in a way that allows automated
checking to confirm that the optimizations haven't corrupted the result.

So, the advantage of the nest I sent is that it allows a later choice to
use the scalars or not, under programmer control.

Dave W

On Sat, Mar 24, 2018 at 9:09 PM, Damian McGuckin <[email protected]> wrote:

> On Wed, 21 Mar 2018, David G. Wonnacott wrote:
>
> for i in 0..w-1 {
>>   xm1 = 0.0;
>>   y1.resetScalars();
>>   for j in 0..h-1 {
>>     y1.set(i,j, a1*imgIn[i,j] + a2*xm1 + b1*y1.get_mRW() +
>> b2*y1.get_pW());
>>     xm1 = imgIn[i,j];
>>   }
>> }
>>
>
> What is the advantage of using set/get as above instead of something like
>
>         // Assume w > 1 and h > 1
>
>         forall i in 0..w-1
>         {
>                 var xy3 = imgIn[i, 0];
>                 var xy2 = a1 * xy3;
>                 var xy1 = 0.0;
>                 var xy0 = xy1;
>
>                 // should this loop be unrolled
>
>                 for j in 1..h-1
>                 {
>                         y1[i, j-1] = xy2;
>
>                         xy0 = xy1
>                         xy1 = xy2;
>                         xy2 = xy3;
>                         xy3 = imgIn[i, j];
>
>                         // the following computation should get
>                         // done in parallel with any branching??
>
>                         xy2 = a1 * xy3 + a2 * xy2 + b1 * xy1 + b2 * xy0;
>                 }
>                 y1[i, h-1] = xy2;
>         }
>
> Do you need to consider reimplementing the kernel
>
>         xy2 = a1 * xy3 + a2 * xy2 + b1 * xy1 + b2 * xy0;
>
> such that it gets compiled into vector instructions but that probably
> means a lot more memory transfer traffic which we probably do not want.
>
> Sorry if it is a dumb question?
>
> Regards - Damian
>
> Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
> Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted
> here
> Views & opinions here are mine and not those of any past or present
> employer
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to