[Bug c++/35117] New: Vectorization on power PC

2008-02-07 Thread eyal at geomage dot com
Hello,
  I am unable to see the expected performance gain using vectorizatio on
powerPC using Linux Suse.
  I've prepared a simple test and compiled it once with vectorization and once
without the vectorization flags. I'd appriciate if someone could point me as to
what Im doing wrong here.Bellow are the results of the test runs:
   time ./TestNoVec 92200 8 89720 1000
   real0m23.549s

   time ./TestVec 92200 8 89720 1000
   real0m22.845s

Here is the code:
#include 
#include 
#include 

typedef float ARRTYPE;
int main ( int argc, char *argv[] )
{
int m_nSamples = atoi( argv[1] );
int itBegin = atoi( argv[2] );
int itEnd = atoi( argv[3] );
int iSizeMain = atoi( argv[ 4 ] );
ARRTYPE *pSum1 = new ARRTYPE[ 10 ];
ARRTYPE *pSum = new ARRTYPE[ 10 ];
for ( int it = 0; it < m_nSamples; it++ )
{
pSum[ it ] = it / itBegin;
pSum1[ it ] = itBegin / ( it + 1 );
}
ARRTYPE *pVec1 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples);
ARRTYPE *pVec2 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples);
for ( int i = 0, j = 0; i < m_nSamples - 5; i++ )
{
for( int it = itBegin; it < itEnd; it++ )
pVec1[ it ] += pSum[ it ] + pSum1[ it ];
}
free( pVec1 );
free( pVec2 );
}

Compilation flag for No vectorization:
gcc  -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I
/usr/local/include -I /usr/include -O3 -fomit-frame-pointer -mtune=powerpc
-falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops 
-fPIC -mcpu=powerpc  -m64 -fargument-noalias -funroll-loops
-ftree-vectorizer-verbose=7 -fdump-tree-vect-details  -c -o Test.o Test.cpp
gcc -lpthread -lz -lm -lstdc++ -DTIXML_USE_STL -I /home/build/build -I
/home/build/build -I. -I /usr/local/include -I /usr/include -O3
-fomit-frame-pointer -mtune=powerpc -falign-functions=16 -fprefetch-loop-arrays
-fpeel-loops -funswitch-loops  -fPIC -mcpu=powerpc  -m64 -fargument-noalias
-funroll-loops -ftree-vectorizer-verbose=7 -fdump-tree-vect-details
-L/usr/local/lib64 -DTIXML_USE_STL -pthread -L. -L /home/build/build/lib64 -L
/home/build/build/lib64 -L /usr/lib64 -L /lib64 -L /opt/gnome/lib64 -o
TestNoVec Test.o

Compilation of vectorized code:
gcc  -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I
/usr/local/include -I /usr/include -O3 -fomit-frame-pointer -mtune=powerpc
-falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops
-ftree-vectorize -fPIC -mcpu=powerpc -maltivec -mabi=altivec -m64
-fargument-noalias -funroll-loops -ftree-vectorizer-verbose=7
-fdump-tree-vect-details  -c -o Test.o Test.cpp
gcc -lpthread -lz -lm -lstdc++ -DTIXML_USE_STL -I /home/build/build -I
/home/build/build -I. -I /usr/local/include -I /usr/include -O3
-fomit-frame-pointer -mtune=powerpc -falign-functions=16 -fprefetch-loop-arrays
-fpeel-loops -funswitch-loops -ftree-vectorize -fPIC -mcpu=powerpc -maltivec
-mabi=altivec -m64 -fargument-noalias -funroll-loops
-ftree-vectorizer-verbose=7 -fdump-tree-vect-details -L/usr/local/lib64
-DTIXML_USE_STL -pthread -L. -L /home/build/build/lib64 -L
/home/build/build/lib64 -L /usr/lib64 -L /lib64 -L /opt/gnome/lib64 -o TestVec
Test.o


-- 
   Summary: Vectorization on power PC
   Product: gcc
   Version: 4.3.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: eyal at geomage dot com
 GCC build triplet: gcc (GCC) 4.3.0 20071124 (experimental)
  GCC host triplet: PowerPC
GCC target triplet: PowerPC


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-07 Thread eyal at geomage dot com


--- Comment #2 from eyal at geomage dot com  2008-02-07 10:36 ---
Yes the loop is vectorized. What do you mean by memory bound? dont you think
that vectorization can help here? I see around 20% performance gain in the real
application.

Bellow is the compiler output:
Eyal.cpp:34: note: dependence distance  = 0.
Eyal.cpp:34: note: accesses have the same alignment.
Eyal.cpp:34: note: dependence distance modulo vf == 0 between *D.22353_81 and
*D.22353_81
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22353_81 and *D.22365_101
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and
*D.22365_101
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22355_85 and *D.22353_81
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22355_85 and
*D.22353_81
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22355_85 and *D.22365_101
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22355_85 and
*D.22365_101
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22361_92 and *D.22353_81
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22361_92 and
*D.22353_81
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22361_92 and *D.22365_101
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22361_92 and
*D.22365_101
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22353_81 and *D.22365_101
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and
*D.22365_101
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22353_81 and *D.22367_105
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and
*D.22367_105
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22353_81 and *D.22371_112
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and
*D.22371_112
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22353_81 and *D.22365_101
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and
*D.22365_101
Eyal.cpp:34: note: dependence distance  = 0.
Eyal.cpp:34: note: accesses have the same alignment.
Eyal.cpp:34: note: dependence distance modulo vf == 0 between *D.22365_101 and
*D.22365_101
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22367_105 and *D.22365_101
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22367_105 and
*D.22365_101
Eyal.cpp:34: note: versioning for alias required: can't determine dependence
between *D.22371_112 and *D.22365_101
Eyal.cpp:34: note: mark for run-time aliasing test between *D.22371_112 and
*D.22365_101
Eyal.cpp:34: note: found equal ranges *D.22353_81, *D.22365_101 and
*D.22353_81, *D.22365_101
Eyal.cpp:34: note: found equal ranges *D.22353_81, *D.22365_101 and
*D.22353_81, *D.22365_101
Eyal.cpp:34: note: === vect_analyze_slp ===
Eyal.cpp:34: note: === vect_make_slp_decision ===
Eyal.cpp:34: note: === vect_detect_hybrid_slp ===
Eyal.cpp:34: note: Alignment of access forced using versioning.
Eyal.cpp:34: note: Alignment of access forced using versioning.
Eyal.cpp:34: note: Vectorizing an unaligned access.
Eyal.cpp:34: note: Vectorizing an unaligned access.
Eyal.cpp:34: note: Vectorizing an unaligned access.
Eyal.cpp:34: note: Vectorizing an unaligned access.
Eyal.cpp:34: note: Vectorizing an unaligned access.
Eyal.cpp:34: note: Vectorizing an unaligned access.
Eyal.cpp:34: note: === vect_update_slp_costs_according_to_vf
===(analyze_scalar_evolution 
Eyal.cpp:34: note: create runtime check for data references *D.22353_81 and
*D.22365_101
Eyal.cpp:34: note: create runtime check for data references *D.22355_85 and
*D.22353_81
Eyal.cpp:34: note: create runtime check for data references *D.22355_85 and
*D.22365_101
Eyal.cpp:34: note: create runtime check for data references *D.22361_92 and
*D.22353_81
Eyal.cpp:34: note: create runtime check for data references *D.22361_92 and
*D.22365_101
Eyal.cpp:34: note: create runtime check for data references *D.22353_81 and
*D.22367_105
Eyal.cpp:34: note: create runtime check for data references *D.22353_81 and
*D.22371_112
Eyal.cpp:34: note: create runtime check for data references *D.22367_105 and
*D.22365_101
Eyal.cpp:34: note: create runtime check for data references *D.22371_112 and
*D.22365_101
Eyal.cpp:34: note: created 9 versioning for alias checks.
Eyal.cpp:34: note: LOOP VECTORIZED.(get_loop_exit_condition 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-07 Thread eyal at geomage dot com


--- Comment #5 from eyal at geomage dot com  2008-02-07 10:43 ---
(In reply to comment #3)
> I think this is a dup of another bug I filed with respect of the builtin
> operator new that getting the malloc attribute.

Are you refering to using malloc instead of new? 
using malloc didnt make any difference performance wise.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-07 Thread eyal at geomage dot com


--- Comment #7 from eyal at geomage dot com  2008-02-07 11:06 ---
(In reply to comment #6)
> (In reply to comment #2)
> > Yes the loop is vectorized. 
> ...
> > Eyal.cpp:34: note: created 9 versioning for alias checks.
> > Eyal.cpp:34: note: LOOP VECTORIZED.(get_loop_exit_condition 
> The vectorizer created runtime checks to verify that there is no data
> dependence in the loop, i.e., if the data references do alias, the vector
> version is skipped and the scalar version of the loop is performed.

Hi,
 That is what I suspected. Anyway I can identify from the log what causes
those runtime checks and resolve it in code, so I can be 100% sure that
the code is fully vectorized?

thanks


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-07 Thread eyal at geomage dot com


--- Comment #8 from eyal at geomage dot com  2008-02-07 12:16 ---
Hi Ira,
  Here is the compiler output for the real code.
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86651_134 and *D.8_160
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86651_134 and *D.86669_168
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86655_139 and *D.8_160
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86655_139 and *D.86669_168
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86658_145 and *D.8_160
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86658_145 and *D.86669_168
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86661_151 and *D.8_160
Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references
*D.86661_151 and *D.86669_168
Crs/CEE_CRE_2DSearch.cpp:1285: note: created 8 versioning for alias checks.

I looked further in the output log and found the following:
D.8_160 = pTempSumPhase_Temp_cre_angle_27 + D.86665_159;
D.86669_168 = pTempSum2Phase_Temp_cre_angle_32 + D.86665_159;
D.86651_134 = pSum_78 + D.86650_133;
D.86655_139 = pSum_78 + D.86654_138;
D.86658_145 = pSum_G_106 + D.86650_133;
D.86661_151 = pSum_G_106 + D.86654_138;

D.86650_133 = D.86649_132 * 4
D.86649_132 = (long unsigned int) ittt_855;

D.86654_138 = D.86653_137 * 4;
D.86653_137 = (long unsigned int) ittt1_856;


It seems it complaints about some relationship between
pTempSum2Phase_Temp_cre_angle_32 and pTempSumPhase_Temp_cre_angle_27 and
pSum_78 and pSum_G_106
Those vectors have nothing in common in the code. How do I make the compiler
see there's no relationship? Here's the C++ code:

 void GCEE_CRE_2DSearch::Find( int i_rCee )
{
float *pTempSumPhase_Temp_cre_angle = (float*) malloc (sizeof(float)
*m_nSamples);
float *pTempSum2Phase_Temp_cre_angle = (float*) malloc (sizeof(float)
*m_nSamples);

memset(pTempSumPhase_Temp_cre_angle,0,sizeof(float)* m_nSamples);
memset(pTempSum2Phase_Temp_cre_angle,0,sizeof(float)* m_nSamples);

float *  pSum, *pSum_G;
.
.
pSum  = m_hiSearchQueue[i_trace];
pSum_G   = m_hiSearchQueue[i_trace];
.
.
for( int it = itBegin, ittt  = itBegin + sample_int, ittt1 = itBegin +
sample_int + 1; it < itEnd; it++, ittt++, ittt1++ )   
{
float fSumValue = pSum[ ittt ] * w11;
fSumValue += pSum[ ittt1 ] * w21;
fSumValue += pSum_G[ ittt ] * w12;
fSumValue += pSum_G[ ittt1 ] * w22;
pTempSumPhase_Temp_cre_angle[ it ] += fSumValue;
pTempSum2Phase_Temp_cre_angle[ it ] += fSumValue * fSumValue;
}


Thanks
Eyal 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-07 Thread eyal at geomage dot com


--- Comment #10 from eyal at geomage dot com  2008-02-07 12:58 ---
(In reply to comment #9)
> (In reply to comment #8)
> > {
> > float *pTempSumPhase_Temp_cre_angle = (float*) malloc (sizeof(float)
> > *m_nSamples);
> > float *pTempSum2Phase_Temp_cre_angle = (float*) malloc 
> > (sizeof(float)
> > *m_nSamples);
> > 
> > memset(pTempSumPhase_Temp_cre_angle,0,sizeof(float)* m_nSamples);
> > memset(pTempSum2Phase_Temp_cre_angle,0,sizeof(float)* m_nSamples);
> Maybe the problem is that they escape (call to memset)...
> The alias analysis fails to distinguish between these two pointers and the
> vectorizer has to create runtime checks.

I've commented the memset operation and still get the 
"created 8 versioning for alias checks." message.

Is there some pragma or a coding convention I can use to make the compiler
understant those pointers have nothing to do with each other?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-07 Thread eyal at geomage dot com


--- Comment #12 from eyal at geomage dot com  2008-02-07 13:07 ---
(In reply to comment #11)
> (In reply to comment #10)
> > Is there some pragma or a coding convention I can use to make the compiler
> > understant those pointers have nothing to do with each other?
> There is __restrict__, but it is useful only for function arguments. 

Ira, any suggestions as to how to solve this issue? I'd realy appriciate any
help here as Im lost and we're close to giving up on PPC and vectorization all
together.

thanks
 eyal


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-08 Thread eyal at geomage dot com


--- Comment #17 from eyal at geomage dot com  2008-02-08 08:58 ---
> Using malloc instead of new does generate better code and improves performance
> slightly for me, admittedly not as much as we would like; the kernel becomes:
> (using only -O3 -S -m64 -maltivec)
> .L29:
> lvx 13,7,9
> lvx 12,3,9
> vperm 1,10,13,7
> vperm 11,9,12,8
> lvx 0,29,9
> vor 10,13,13
> vor 9,12,12
> vaddfp 1,1,11
> vaddfp 0,0,1
> stvx 0,29,9
> addi 9,9,16
> bdnz .L29
> which is as good as the vectorizer can get, iinm: peeling the loop to align 
> the
> store (and the load from the same address), treating the other two loads as
> potentially unaligned.
> To further optimize this loop we would probably want to overlap the store with
> subsequent loads using -fmodulo-sched; perhaps the new export-ddg can help 
> with
> that.

I was able to get about 20% more in one case with malloc.
I was expecting something like 2-4 times faster when the vectorization is
enabled.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-08 Thread eyal at geomage dot com


--- Comment #16 from eyal at geomage dot com  2008-02-08 08:55 ---
Thanks a lot Ira, I appriciate it.
If you need the full test code with .vect file and makefiles,please let me
know.
thanks,
eyal


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-09 Thread eyal at geomage dot com


--- Comment #19 from eyal at geomage dot com  2008-02-10 07:42 ---
Hi,  
  This is the simplest test I have.

#include 
#include 
#include 

typedef float ARRTYPE;

int main ( int argc, char *argv[] )
{
int m_nSamples = atoi( argv[1] );
int itBegin = atoi( argv[2] );
int itEnd = atoi( argv[3] );
int iSizeMain = atoi( argv[ 4 ] );
ARRTYPE *pSum1 = new ARRTYPE[ 10 ];
ARRTYPE *pSum = new ARRTYPE[ 10 ];
for ( int it = 0; it < m_nSamples; it++ )
{
pSum[ it ] = it / itBegin;
pSum1[ it ] = itBegin / ( it + 1 );
}
ARRTYPE *pVec1 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples);
ARRTYPE *pVec2 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples);
for ( int i = 0; i < m_nSamples - 5; i++ )
{
for( int it = itBegin; it < itEnd; it++ )
pVec1[ it ] += pSum[ it ] + pSum1[ it ];
}
free( pVec1 );
free( pVec2 );
}

// Test - Vectorized binary, TestNoVec - Non vectorized binary
time ./Test 9 1 89900 1
real0m23.273s

time ./TestNoVec 9 1 89900 1
real0m24.344s


This is the compiler output I found relevant, please let me know if you need
more information.

Test.cpp:24: note: dependence distance modulo vf == 0 between *D.22310_50 and
*D.22310_50
Test.cpp:24: note: versioning for alias required: can't determine dependence
between *D.22312_54 and *D.22310_50
Test.cpp:24: note: mark for run-time aliasing test between *D.22312_54 and
*D.22310_50
Test.cpp:24: note: versioning for alias required: can't determine dependence
between *D.22314_58 and *D.22310_50
Test.cpp:24: note: mark for run-time aliasing test between *D.22314_58 and
*D.22310_50
Test.cpp:24: note: create runtime check for data references *D.22312_54 and
*D.22310_50
Test.cpp:24: note: create runtime check for data references *D.22314_58 and
*D.22310_50
Test.cpp:24: note: created 2 versioning for alias checks.
Test.cpp:24: note: LOOP VECTORIZED.(get_loop_exit_condition


D.22310_50 = pVec1_37 + D.22309_49;
D.22312_54 = pSum_20 + D.22309_49;
D.22314_58 = pSum1_18 + D.22309_49;


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-09 Thread eyal at geomage dot com


--- Comment #20 from eyal at geomage dot com  2008-02-10 07:56 ---
Hi,
  I've tried putting the loop to be vectorized in a different method and the
compiler output looks better, but the performance is still the same as the
non-vectorized code.

#include 
#include 
#include 

typedef float ARRTYPE;

void Calc( ARRTYPE *pSum, ARRTYPE *pSum1, ARRTYPE *pVec1, ARRTYPE *pVec2, int
m_nSamples, int itBegin, int itEnd );

int main ( int argc, char *argv[] )
{
int m_nSamples = atoi( argv[1] );
int itBegin = atoi( argv[2] );
int itEnd = atoi( argv[3] );
int iSizeMain = atoi( argv[ 4 ] );
ARRTYPE *pSum1 = new ARRTYPE[ 10 ];
ARRTYPE *pSum = new ARRTYPE[ 10 ];
for ( int it = 0; it < m_nSamples; it++ )
{
pSum[ it ] = it / itBegin;
pSum1[ it ] = itBegin / ( it + 1 );
}
ARRTYPE *pVec1 = NULL, *pVec2 = NULL;
Calc( pSum, pSum1, pVec1, pVec2, m_nSamples, itBegin, itEnd );
std::cout << "pVec1[10]  = " << pVec1[ 10 ] << std::endl;
std::cout << "pVec1[102]  = " << pVec1[ 102 ] << std::endl;
free( pVec1 );
free( pVec2 );
}

void Calc( ARRTYPE *pSum, ARRTYPE *pSum1, ARRTYPE *pVec1, ARRTYPE *pVec2, int
m_nSamples, int itBegin, int itEnd )
{
pVec1 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples);
pVec2 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples);
for ( int i = 0; i < m_nSamples - 5; i++ )
{
for( int it = itBegin; it < itEnd; it++ )
pVec1[ it ] += pSum[ it ] + pSum1[ it ];
}
}




Eyal.cpp:36: note: dependence distance  = 0.
Eyal.cpp:36: note: accesses have the same alignment.
Eyal.cpp:36: note: dependence distance modulo vf == 0 between *D.22348_22 and
*D.22348_22
Eyal.cpp:36: note: === vect_analyze_slp ===
Eyal.cpp:36: note: === vect_make_slp_decision ===
Eyal.cpp:36: note: === vect_detect_hybrid_slp ===(analyze_scalar_evolution 
  (loop_nb = 2)
  (scalar = it_60)
(get_scalar_evolution 
  (scalar = it_60)
  (scalar_evolution = {itBegin_14(D), +, 1}_2))
(set_scalar_evolution 
  (scalar = it_60)
  (scalar_evolution = {itBegin_14(D), +, 1}_2))
)
(instantiate_parameters 
  (loop_nb = 2)
  (chrec = {itBegin_14(D), +, 1}_2)
  (res = {itBegin_14(D), +, 1}_2))
(get_loop_exit_condition 
  if (itEnd_16(D) > it_36))

Eyal.cpp:36: note: Alignment of access forced using peeling.
Eyal.cpp:36: note: Vectorizing an unaligned access.
Eyal.cpp:36: note: Vectorizing an unaligned access.
Eyal.cpp:36: note: === vect_update_slp_costs_according_to_vf
===(analyze_scalar_evolution 
  (loop_nb = 2)
  (scalar = it_60)
(get_scalar_evolution 
  (scalar = it_60)
  (scalar_evolution = {itBegin_14(D), +, 1}_2))
(set_scalar_evolution 
  (scalar = it_60)
  (scalar_evolution = {itBegin_14(D), +, 1}_2))
)
(instantiate_parameters 
  (loop_nb = 2)
  (chrec = {itBegin_14(D), +, 1}_2)
  (res = {itBegin_14(D), +, 1}_2))
(get_loop_exit_condition 
  if (itEnd_16(D) > it_36))
(get_loop_exit_condition 
  if (itEnd_16(D) > it_36))
(get_loop_exit_condition 
  if (itEnd_16(D) > it_84))
(get_loop_exit_condition 
  if (ivtmp.267_92 < prolog_loop_niters.266_70))

loop at Eyal.cpp:37: if (ivtmp.267_92 <
prolog_loop_niters.266_70)(get_loop_exit_condition 
  if (itEnd_16(D) > it_36))
(analyze_scalar_evolution 
  (loop_nb = 2)
  (scalar = it_60)
(get_scalar_evolution 
  (scalar = it_60)
  (scalar_evolution = ))
(analyze_initial_condition 
  (loop_phi_node = 
it_60 = PHI )
  (init_cond = it_86))
(analyze_evolution_in_loop 
  (loop_phi_node = it_60 = PHI )
(add_to_evolution 
  (loop_nb = 2)
  (chrec_before = it_86)
  (to_add = 1)
  (res = {it_86, +, 1}_2))
  (evolution_function = {it_86, +, 1}_2))
(set_scalar_evolution 
  (scalar = it_60)
  (scalar_evolution = {it_86, +, 1}_2))
)
(get_loop_exit_condition 
  if (itEnd_16(D) > it_36))
(get_loop_exit_condition 
  if (ivtmp.329_211 < bnd.269_99))

loop at Eyal.cpp:37: if (ivtmp.329_211 < bnd.269_99)

Registering new PHI nodes in block #0



Registering new PHI nodes in block #2

Updating SSA information for statement D.22335_6 = malloc (D.22334_5);

Updating SSA information for statement malloc (D.22334_5);



Registering new PHI nodes in block #3



Registering new PHI nodes in block #9



Registering new PHI nodes in block #7



Registering new PHI nodes in block #8



Registering new PHI nodes in block #10



Registering new PHI nodes in block #14



Registering new PHI nodes in block #12

Updating SSA information for statement D.22349_76 = *D.22348_75;

Updating SSA information for statement *D.22348_75 = D.22355_82;



Registering new PHI nodes in block #13



Registering new PHI nodes in block #16



Registering new PHI nodes in block #15



Registering new PHI nodes in block #21



Registering new PHI nodes in block #22



Registering new PHI nodes in block #19

Updating SSA informatio

[Bug c++/35117] Vectorization on power PC

2008-02-10 Thread eyal at geomage dot com


--- Comment #21 from eyal at geomage dot com  2008-02-10 13:48 ---
(In reply to comment #14)
> Giving it another thought, this is not necessary an alias analysis issue, even
> that it fails to tell that the pointers not alias. Since in this case the
> pointers do differ, the runtime test should take the flow to the vectorized
> loop. Maybe the test is too strict. I'll look into this on Sunday.

Hi,
 Any update on this matter?

thanks
eyal


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117




[Bug c++/35117] Vectorization on power PC

2008-02-10 Thread eyal at geomage dot com


--- Comment #23 from eyal at geomage dot com  2008-02-10 15:47 ---
(In reply to comment #22)
> 1. It looks like vectorizer was enabled in both cases, since -O3 enables the
> vectorizer by the default. You need to add -fno-tree-vectorize to disable it
> explicitly.
> 2. To get better results from vectorized version I would recommend to allocate
> arrays at boundaries aligned to 16 byte and let to the compiler to know this.
> You can do it by static allocation of arrays:
>   float pSum1[64000] __attribute__ ((__aligned__(16)));
>   float pSum[64000] __attribute__ ((__aligned__(16)));
>   float pVec1[64000] __attribute__ ((__aligned__(16)));
> 3. It would be better if "itBegin" will start from 0 and be known at compile
> time. This and [2] will allow to vectorizer to save realigning loads.
> 4. For some strange reason the run time of this test can vary significantly 
> (up
> to 50%) from run to run. So be sure to run it several times.
> -- Victor.

Hi,
  Item 2 is problematic as the data can vary a lot and I cant use static
arrays.  Im also willing to pay a "reasonable" price for the alignment extra
actions.  
  Item 3: I cant make itBegin start from zero, since this is how the formula
we're using works. Its calculated everytime and can vary in value.
  Item 4: I saw consistent results everytime I ran it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-11 Thread eyal at geomage dot com


--- Comment #27 from eyal at geomage dot com  2008-02-11 14:00 ---
Hi,
  I am a bit lost and appriciate your guidelines. Up till now, after all those
emails, I still have no clue as to why such a simple test case doesnt work. As
far as I understood the vectorization should have shown between 2 to 4 times
faster. With all the suggestions here I still didnt get more then 20-30%
performance gain. 
  I would appriciate if someone from the vectorization team could come up with
detailed explaination as to how to make the vectorization do whats promised. 

  As for the last email, Victor:
  1. Using a smaller number of iterations, doesnt help me. This is not what the
real world code runs.
  2. new/malloc almost didnt do anything maybe a gain of 20%
  3. The difference between 1.738sec and 0.781sec can either be a 2 times
performance gain or simply a 1 second gain that would remain 1 second for more
intensive calculations. Therefore I cant use/rely on the test you did.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-12 Thread eyal at geomage dot com


--- Comment #30 from eyal at geomage dot com  2008-02-12 08:43 ---
Hi,
  Thanks a lot for the input about a potential memory bottle-neck. I indeed was
under the impression that once I got the loop vectorized, I'd immidiatly see a
performance boost.
  I would appriciate, however, a further explaination about this issue.
  After all, this is a very simple test case. I still dont understand why the
hugh diffference when I run:
  time ./TestNoVec 92200 8 89720 1000
   real0m23.549s

   time ./TestVec 92200 8 89720 1000
   real0m22.845s

and when I run:
[EMAIL PROTECTED]:~> time ./mnovec 40 1 29720 1000

real0m24.493s
user0m24.483s
sys 0m0.007s
[EMAIL PROTECTED]:~> time ./mvec 40 1 29720 1000

real0m10.777s
user0m10.771s
sys 0m0.005s


I cant see from the code how those parameter diff effect the performance so
much. I'd appriciate your assistance again.

thanks
eyal


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-12 Thread eyal at geomage dot com


--- Comment #32 from eyal at geomage dot com  2008-02-12 11:28 ---
(In reply to comment #31)
> > I would appriciate, however, a further explaination about this issue.
> The explanation has to deal with CPU architecture and is not related to
> compilers.  In case of cache miss the memory load and store take tens of cpu
> cycles instead of few cycles in case of cache hit.
> When we run:
> time ./mvec 40 1 29720 1000
> The program perform 40 iterations of outer loop and 29720 iterations in
> internal loop. The internal loop performs 3 load accesses and one store access
> per iteration. Starting from second iteration of outer loop, all  29720
> elements of arrays pSum, pSum1 and pVec1 will be placed into cache and from
> this point all accesses will be cache hits. (I assume that data cache is big
> enough to contain all 29720*3 elements).
> Lets look at the slow run:
> % time ./TestVec 92200 8 89720 1000
> Here the program perform (89720-8) iterations in internal loop, so in order to
> have cache hits most of the time we need the cache to be at least 89712*3 in
> size.  Lets consider what will happen if cache size is only half of required
> amount.  After completion of first iteration of the outer loop, the cache will
> be filled with second half of data from arrays.  At start of second iteration
> of outer loop, all elements from first half will be evicted from the cache as
> most caches use LRU policy to choose evicted elements.  Considering that 
> PPC970
> is out-of-order, multiple-issue architecture we can guess why CPU have enough
> time to perform arithmetic operations even in scalar manner without adding any
> overhead relatively to vectorized version of internal loop.


Thanks a lot for the detailed explaination Victor. I'll try to see if I can
break the real code to be more memory friendly.
Again thanks a lot guys.

eyal


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117



[Bug c++/35117] Vectorization on power PC

2008-02-13 Thread eyal at geomage dot com


--- Comment #33 from eyal at geomage dot com  2008-02-13 16:06 ---
Hi All,
  I've done some changes that hopefully prevent the memory from being a
performance bottleneck. I see a perf gain of ~10%. However the compiler still
gives me the warnings in comment #19 - 
Test.cpp:24: note: versioning for alias required: can't determine dependence
between *D.22312_54 and *D.22310_50
Test.cpp:24: note: mark for run-time aliasing test between *D.22312_54 and
*D.22310_50
Test.cpp:24: note: versioning for alias required: can't determine dependence
between *D.22314_58 and *D.22310_50
Test.cpp:24: note: mark for run-time aliasing test between *D.22314_58 and
*D.22310_50
Test.cpp:24: note: create runtime check for data references *D.22312_54 and
*D.22310_50
Test.cpp:24: note: create runtime check for data references *D.22314_58 and
*D.22310_50
Test.cpp:24: note: created 2 versioning for alias checks.
Test.cpp:24: note: LOOP VECTORIZED.(get_loop_exit_condition


How do I resolve those issues? which might prevent from the vectorized code to
run and therefore I dont see a bigger performance improvement?
I'd appriciate any assistance...

thanks
eyal


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117