I'm developing a solution to optimize away intermediate stores (loads) for 
static local variables which are assigned to before referenced on every path 
through a function.

Currently GCC eliminates all loads/stores in a straight line code but not in 
control structures. AFAIU, passes like sccvn, lim (store motion in particular) 
sink etc. try to propagate the assignment exprs through SSA vars instead of 
through memory thereby eliminating almost all loads and some stores but still 
not all.

For example:

void some_function()
{              
    static some_type sum;
    ...
    for( i = 0 ; i < n ; i++ )
         {
             sum = matrix [i][n] ; 

             for( j = 0 ; j < i ; j++ )
                 {
                     sum -= matrix [i][j] * matrix [j][n] ; 
                 }

            matrix [i][n] = sum ;
        }
    ...
}

Resulting SSA:

...
  <bb 14> [2.25%]:
  _10 = matrix[0][n_13(D)];
  sum = _10;
  goto <bb 10>; [100.00%]
...
  <bb 4> [12.75%]:
  _1 = matrix[i_16][n_13(D)];
  if (i_16 > 0)
    goto <bb 6>; [100.00%]
  else
    goto <bb 9>; [0.00%]
...
  <bb 6> [85.00%]:
  # j_24 = PHI <j_18(6), 0(4)>
  # sum_lsm.4_27 = PHI <_6(6), _1(4)>
  _3 = matrix[i_16][j_24];
  _4 = matrix[j_24][n_13(D)];
  _5 = _3 * _4;
  _6 = sum_lsm.4_27 - _5;
  j_18 = j_24 + 1;
  if (i_16 > j_18)
    goto <bb 6>; [85.00%]
  else
    goto <bb 7>; [15.00%]
...
  <bb 7> [12.75%]:
  # _40 = PHI <_6(6)>
  goto <bb 9>; [100.00%]

  <bb 9> [12.75%]:
  # sum_lsm.4_19 = PHI <_40(7), _1(4)>

  <bb 10> [15.00%]:
  # i_20 = PHI <i_16(9), 0(14)>
  # sum_lsm.4_8 = PHI <sum_lsm.4_19(9), _10(14)>
  # sum_lsm.5_22 = PHI <1(9), 0(14)>
  matrix[i_20][n_13(D)] = sum_lsm.4_8;
  i_16 = i_20 + 1;
  if (n_13(D) > i_16)
    goto <bb 4>; [85.00%]
  else
    goto <bb 11>; [15.00%]

  <bb 11> [2.25%]:
  # sum_lsm.4_39 = PHI <sum_lsm.4_8(10)>
  # sum_lsm.5_38 = PHI <sum_lsm.5_22(10)>
  if (sum_lsm.5_38 != 0)
    goto <bb 12>; [0.00%]
  else
    goto <bb 13>; [100.00%]

  <bb 12> [0.00%]:
  sum = sum_lsm.4_39;
...

Although all intermediate loads are eliminated here, store to sum before and 
after the 'j' loop (see bb 14 and bb 12) still remain which can be eliminated.

My solution would reuse a lot of data structures and routines from the sink 
pass. So my question is, can I add this optimization into the sink pass itself 
or would it require a new pass and if yes, what would be the ideal place for it.
Any other pointers are more than welcome.

Thanks,
Prachi

Reply via email to