In other news, I tried switiching it from "class" to "record", and the
notation seems the same :-)  and the performance as good as I had hoped :-)

Dave W

On Wed, Mar 21, 2018 at 5:17 PM, Brad Chamberlain <[email protected]> wrote:

>
> Hi David --
>
> Got it, thanks for the additional information.
>
> Yes, Chapel does have an 'inline' keyword that you can apply to procedures
> in order to cause them to be inlined at callsites (Section 13.2 of the
> language specification).  In a deeper class hierarchy where there may be
> dynamic dispatch, the ability of the compiler to inline may be limited, but
> for a simple case like yours, it should generate simpler code for the
> back-end C compiler when 'inline' is used.
>
> As an example, for this simple class:
>
> ----
> class C {
>   var x: int;
>
>   proc get() {
>     return x;
>   }
>
>   proc set(x) {
>     this.x = x;
>   }
> }
> ----
>
> The following calls:
>
> ----
> myC.set(42);
> writeln(myC.get());
> ----
>
> Translate into the following normally:
>
> ----
>   set_chpl(myC_chpl, INT64(42));
>   call_tmp_chpl = get_chpl(myC_chpl);
>   writeln_chpl2(call_tmp_chpl);
> ----
>
> (where I'm omitting the definitions of the set_chpl() / get_chpl()
> routines themselves).  Wheras, if I put an 'inline' keyword before each of
> the 'proc' keywords, I get:
>
> ----
>   (myC_chpl)->x_chpl = INT64(42);
>   writeln_chpl2((myC_chpl)->x_chpl);
> ----
>
> Whether or not these result in a performance improvement depend heavily on
> how aggressively the back-end C compiler would've optimized the non-inlined
> version of the code...  I'll be curious to hear whether it helps with your
> case.
>
> -Brad
>
>
> On Wed, 21 Mar 2018, David G. Wonnacott wrote:
>
> OK, this is a Chapel version of the Deriche image-processing kernel from
>> the PolyBench benchmark.  It involves some loops that look like this:
>>
>>
>>
>> for i in 0..w-1 {
>>  ym1 = 0.0;
>>  ym2 = 0.0;
>>  xm1 = 0.0;
>>  for j in 0..h-1 {
>>    y1[i,j] = a1*imgIn[i,j] + a2*xm1 + b1*ym1 + b2*ym2;
>>    xm1 = imgIn[i,j];
>>    ym2 = ym1;
>>    ym1 = y1[i,j];
>>  }
>> }
>>
>>
>> and which we've abstracted with a class that captures the idea of "keep
>> track of the previously-written value" like this:
>>
>>
>>
>> class ourArray {
>>  const W: int;
>>  const H: int;
>>  const dom: domain(2);//maybe shouldn't be a const
>>
>>  var Vals: [dom] real;
>>  var mostRecentWrite: real; //ym1
>>
>>  var previousWrite: real; //ym2
>>
>>
>>  proc derrayConcise2Triererer(width: int, height: int){
>>    W = width;    H = height;   dom = {0..W-1,0..H-1}; }
>>
>>   proc set(i: int, j: int, value: real) {
>>    previousWrite = mostRecentWrite;  # not used in try1 or try2
>>    mostRecentWrite = value;          # not used in try1 or try2
>>    Vals[i,j] = mostRecentWrite;      # set to 'value' in try1 and try2
>>  }
>>
>>  proc get(i: int, j: int) {  return Vals[i,j]; }
>>
>>  proc resetScalars()      {  mostRecentWrite = 0;  previousWrite = 0;  }
>>
>>  proc get_mRW()           {  return mostRecentWrite;  }
>>
>> // ...
>>
>> for i in 0..w-1 {
>>  xm1 = 0.0;
>>  y1.resetScalars();
>>  for j in 0..h-1 {
>>    y1.set(i,j, a1*imgIn[i,j] + a2*xm1 + b1*y1.get_mRW() + b2*y1.get_pW());
>>    xm1 = imgIn[i,j];
>>  }
>> }
>>
>>
>> When we leave the original scalars in place and use get/set on the
>> *array* elements,
>> we don't lose much performance, even if we're setting redundant scalar
>> values in the class as well as the main oop. However, if we use the
>> 'get'_mRW' and 'get_pW' method to access the scalars, we get a much more
>> significant drop. I can attach a large collection of files, if you need
>> more than just the basic idea.
>>
>> I've attached a graph of performance (higher=faster) as a function of data
>> set size (number of bytes per array), with various expeirments. There are
>> 3
>> runs for each code, so there is also a minor x-offset so that things don't
>> end up on top of each other and you can see the data; in other words, what
>> look sort of like two columns on the left really all use the same problem
>> size. Original, try1, and try2 are just variants on how we access the
>> *array* data; try3 involves storing the scalars in the 'set' method, and
>> try4 and try5 and concise are various approaches to *using* the scalars
>> from the class... try4 is the code shown in the loop above.
>>
>> Dave W
>>
>>
>> On Wed, Mar 21, 2018 at 4:34 PM, Brad Chamberlain <[email protected]> wrote:
>>
>>
>>> Hi David --
>>>
>>> I think the tools you'd use to optimize cases like this depend heavily on
>>> the idioms in the code.  Can you share a simplified program that
>>> demonstrates the pattern you're wrestling with as a basis for further
>>> conversation?
>>>
>>> Thanks,
>>> -Brad
>>>
>>>
>>> On Wed, 21 Mar 2018, David G. Wonnacott wrote:
>>>
>>> I've done some experiments and found that the performance of some code
>>> I'm
>>>
>>>> writing seems to be limited by the use of 'get' methods to access some
>>>> scalars. In C++, I'd use the 'inline' keyword to try to optimize these
>>>> ...
>>>> is there an equivalent for Chapel? Should I be changing the class to a
>>>> record? That would be slightly inconvenent but not really a problem.
>>>>
>>>> Dave W
>>>>
>>>>
>>>>
>>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to