[Bug c++/35117] New: Vectorization on power PC
Hello, I am unable to see the expected performance gain using vectorizatio on powerPC using Linux Suse. I've prepared a simple test and compiled it once with vectorization and once without the vectorization flags. I'd appriciate if someone could point me as to what Im doing wrong here.Bellow are the results of the test runs: time ./TestNoVec 92200 8 89720 1000 real0m23.549s time ./TestVec 92200 8 89720 1000 real0m22.845s Here is the code: #include #include #include typedef float ARRTYPE; int main ( int argc, char *argv[] ) { int m_nSamples = atoi( argv[1] ); int itBegin = atoi( argv[2] ); int itEnd = atoi( argv[3] ); int iSizeMain = atoi( argv[ 4 ] ); ARRTYPE *pSum1 = new ARRTYPE[ 10 ]; ARRTYPE *pSum = new ARRTYPE[ 10 ]; for ( int it = 0; it < m_nSamples; it++ ) { pSum[ it ] = it / itBegin; pSum1[ it ] = itBegin / ( it + 1 ); } ARRTYPE *pVec1 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples); ARRTYPE *pVec2 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples); for ( int i = 0, j = 0; i < m_nSamples - 5; i++ ) { for( int it = itBegin; it < itEnd; it++ ) pVec1[ it ] += pSum[ it ] + pSum1[ it ]; } free( pVec1 ); free( pVec2 ); } Compilation flag for No vectorization: gcc -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -O3 -fomit-frame-pointer -mtune=powerpc -falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops -fPIC -mcpu=powerpc -m64 -fargument-noalias -funroll-loops -ftree-vectorizer-verbose=7 -fdump-tree-vect-details -c -o Test.o Test.cpp gcc -lpthread -lz -lm -lstdc++ -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -O3 -fomit-frame-pointer -mtune=powerpc -falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops -fPIC -mcpu=powerpc -m64 -fargument-noalias -funroll-loops -ftree-vectorizer-verbose=7 -fdump-tree-vect-details -L/usr/local/lib64 -DTIXML_USE_STL -pthread -L. -L /home/build/build/lib64 -L /home/build/build/lib64 -L /usr/lib64 -L /lib64 -L /opt/gnome/lib64 -o TestNoVec Test.o Compilation of vectorized code: gcc -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -O3 -fomit-frame-pointer -mtune=powerpc -falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops -ftree-vectorize -fPIC -mcpu=powerpc -maltivec -mabi=altivec -m64 -fargument-noalias -funroll-loops -ftree-vectorizer-verbose=7 -fdump-tree-vect-details -c -o Test.o Test.cpp gcc -lpthread -lz -lm -lstdc++ -DTIXML_USE_STL -I /home/build/build -I /home/build/build -I. -I /usr/local/include -I /usr/include -O3 -fomit-frame-pointer -mtune=powerpc -falign-functions=16 -fprefetch-loop-arrays -fpeel-loops -funswitch-loops -ftree-vectorize -fPIC -mcpu=powerpc -maltivec -mabi=altivec -m64 -fargument-noalias -funroll-loops -ftree-vectorizer-verbose=7 -fdump-tree-vect-details -L/usr/local/lib64 -DTIXML_USE_STL -pthread -L. -L /home/build/build/lib64 -L /home/build/build/lib64 -L /usr/lib64 -L /lib64 -L /opt/gnome/lib64 -o TestVec Test.o -- Summary: Vectorization on power PC Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: eyal at geomage dot com GCC build triplet: gcc (GCC) 4.3.0 20071124 (experimental) GCC host triplet: PowerPC GCC target triplet: PowerPC http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #2 from eyal at geomage dot com 2008-02-07 10:36 --- Yes the loop is vectorized. What do you mean by memory bound? dont you think that vectorization can help here? I see around 20% performance gain in the real application. Bellow is the compiler output: Eyal.cpp:34: note: dependence distance = 0. Eyal.cpp:34: note: accesses have the same alignment. Eyal.cpp:34: note: dependence distance modulo vf == 0 between *D.22353_81 and *D.22353_81 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22353_81 and *D.22365_101 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and *D.22365_101 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22355_85 and *D.22353_81 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22355_85 and *D.22353_81 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22355_85 and *D.22365_101 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22355_85 and *D.22365_101 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22361_92 and *D.22353_81 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22361_92 and *D.22353_81 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22361_92 and *D.22365_101 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22361_92 and *D.22365_101 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22353_81 and *D.22365_101 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and *D.22365_101 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22353_81 and *D.22367_105 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and *D.22367_105 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22353_81 and *D.22371_112 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and *D.22371_112 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22353_81 and *D.22365_101 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22353_81 and *D.22365_101 Eyal.cpp:34: note: dependence distance = 0. Eyal.cpp:34: note: accesses have the same alignment. Eyal.cpp:34: note: dependence distance modulo vf == 0 between *D.22365_101 and *D.22365_101 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22367_105 and *D.22365_101 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22367_105 and *D.22365_101 Eyal.cpp:34: note: versioning for alias required: can't determine dependence between *D.22371_112 and *D.22365_101 Eyal.cpp:34: note: mark for run-time aliasing test between *D.22371_112 and *D.22365_101 Eyal.cpp:34: note: found equal ranges *D.22353_81, *D.22365_101 and *D.22353_81, *D.22365_101 Eyal.cpp:34: note: found equal ranges *D.22353_81, *D.22365_101 and *D.22353_81, *D.22365_101 Eyal.cpp:34: note: === vect_analyze_slp === Eyal.cpp:34: note: === vect_make_slp_decision === Eyal.cpp:34: note: === vect_detect_hybrid_slp === Eyal.cpp:34: note: Alignment of access forced using versioning. Eyal.cpp:34: note: Alignment of access forced using versioning. Eyal.cpp:34: note: Vectorizing an unaligned access. Eyal.cpp:34: note: Vectorizing an unaligned access. Eyal.cpp:34: note: Vectorizing an unaligned access. Eyal.cpp:34: note: Vectorizing an unaligned access. Eyal.cpp:34: note: Vectorizing an unaligned access. Eyal.cpp:34: note: Vectorizing an unaligned access. Eyal.cpp:34: note: === vect_update_slp_costs_according_to_vf ===(analyze_scalar_evolution Eyal.cpp:34: note: create runtime check for data references *D.22353_81 and *D.22365_101 Eyal.cpp:34: note: create runtime check for data references *D.22355_85 and *D.22353_81 Eyal.cpp:34: note: create runtime check for data references *D.22355_85 and *D.22365_101 Eyal.cpp:34: note: create runtime check for data references *D.22361_92 and *D.22353_81 Eyal.cpp:34: note: create runtime check for data references *D.22361_92 and *D.22365_101 Eyal.cpp:34: note: create runtime check for data references *D.22353_81 and *D.22367_105 Eyal.cpp:34: note: create runtime check for data references *D.22353_81 and *D.22371_112 Eyal.cpp:34: note: create runtime check for data references *D.22367_105 and *D.22365_101 Eyal.cpp:34: note: create runtime check for data references *D.22371_112 and *D.22365_101 Eyal.cpp:34: note: created 9 versioning for alias checks. Eyal.cpp:34: note: LOOP VECTORIZED.(get_loop_exit_condition -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #5 from eyal at geomage dot com 2008-02-07 10:43 --- (In reply to comment #3) > I think this is a dup of another bug I filed with respect of the builtin > operator new that getting the malloc attribute. Are you refering to using malloc instead of new? using malloc didnt make any difference performance wise. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #7 from eyal at geomage dot com 2008-02-07 11:06 --- (In reply to comment #6) > (In reply to comment #2) > > Yes the loop is vectorized. > ... > > Eyal.cpp:34: note: created 9 versioning for alias checks. > > Eyal.cpp:34: note: LOOP VECTORIZED.(get_loop_exit_condition > The vectorizer created runtime checks to verify that there is no data > dependence in the loop, i.e., if the data references do alias, the vector > version is skipped and the scalar version of the loop is performed. Hi, That is what I suspected. Anyway I can identify from the log what causes those runtime checks and resolve it in code, so I can be 100% sure that the code is fully vectorized? thanks -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #8 from eyal at geomage dot com 2008-02-07 12:16 --- Hi Ira, Here is the compiler output for the real code. Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86651_134 and *D.8_160 Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86651_134 and *D.86669_168 Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86655_139 and *D.8_160 Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86655_139 and *D.86669_168 Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86658_145 and *D.8_160 Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86658_145 and *D.86669_168 Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86661_151 and *D.8_160 Crs/CEE_CRE_2DSearch.cpp:1285: note: create runtime check for data references *D.86661_151 and *D.86669_168 Crs/CEE_CRE_2DSearch.cpp:1285: note: created 8 versioning for alias checks. I looked further in the output log and found the following: D.8_160 = pTempSumPhase_Temp_cre_angle_27 + D.86665_159; D.86669_168 = pTempSum2Phase_Temp_cre_angle_32 + D.86665_159; D.86651_134 = pSum_78 + D.86650_133; D.86655_139 = pSum_78 + D.86654_138; D.86658_145 = pSum_G_106 + D.86650_133; D.86661_151 = pSum_G_106 + D.86654_138; D.86650_133 = D.86649_132 * 4 D.86649_132 = (long unsigned int) ittt_855; D.86654_138 = D.86653_137 * 4; D.86653_137 = (long unsigned int) ittt1_856; It seems it complaints about some relationship between pTempSum2Phase_Temp_cre_angle_32 and pTempSumPhase_Temp_cre_angle_27 and pSum_78 and pSum_G_106 Those vectors have nothing in common in the code. How do I make the compiler see there's no relationship? Here's the C++ code: void GCEE_CRE_2DSearch::Find( int i_rCee ) { float *pTempSumPhase_Temp_cre_angle = (float*) malloc (sizeof(float) *m_nSamples); float *pTempSum2Phase_Temp_cre_angle = (float*) malloc (sizeof(float) *m_nSamples); memset(pTempSumPhase_Temp_cre_angle,0,sizeof(float)* m_nSamples); memset(pTempSum2Phase_Temp_cre_angle,0,sizeof(float)* m_nSamples); float * pSum, *pSum_G; . . pSum = m_hiSearchQueue[i_trace]; pSum_G = m_hiSearchQueue[i_trace]; . . for( int it = itBegin, ittt = itBegin + sample_int, ittt1 = itBegin + sample_int + 1; it < itEnd; it++, ittt++, ittt1++ ) { float fSumValue = pSum[ ittt ] * w11; fSumValue += pSum[ ittt1 ] * w21; fSumValue += pSum_G[ ittt ] * w12; fSumValue += pSum_G[ ittt1 ] * w22; pTempSumPhase_Temp_cre_angle[ it ] += fSumValue; pTempSum2Phase_Temp_cre_angle[ it ] += fSumValue * fSumValue; } Thanks Eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #10 from eyal at geomage dot com 2008-02-07 12:58 --- (In reply to comment #9) > (In reply to comment #8) > > { > > float *pTempSumPhase_Temp_cre_angle = (float*) malloc (sizeof(float) > > *m_nSamples); > > float *pTempSum2Phase_Temp_cre_angle = (float*) malloc > > (sizeof(float) > > *m_nSamples); > > > > memset(pTempSumPhase_Temp_cre_angle,0,sizeof(float)* m_nSamples); > > memset(pTempSum2Phase_Temp_cre_angle,0,sizeof(float)* m_nSamples); > Maybe the problem is that they escape (call to memset)... > The alias analysis fails to distinguish between these two pointers and the > vectorizer has to create runtime checks. I've commented the memset operation and still get the "created 8 versioning for alias checks." message. Is there some pragma or a coding convention I can use to make the compiler understant those pointers have nothing to do with each other? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #12 from eyal at geomage dot com 2008-02-07 13:07 --- (In reply to comment #11) > (In reply to comment #10) > > Is there some pragma or a coding convention I can use to make the compiler > > understant those pointers have nothing to do with each other? > There is __restrict__, but it is useful only for function arguments. Ira, any suggestions as to how to solve this issue? I'd realy appriciate any help here as Im lost and we're close to giving up on PPC and vectorization all together. thanks eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #17 from eyal at geomage dot com 2008-02-08 08:58 --- > Using malloc instead of new does generate better code and improves performance > slightly for me, admittedly not as much as we would like; the kernel becomes: > (using only -O3 -S -m64 -maltivec) > .L29: > lvx 13,7,9 > lvx 12,3,9 > vperm 1,10,13,7 > vperm 11,9,12,8 > lvx 0,29,9 > vor 10,13,13 > vor 9,12,12 > vaddfp 1,1,11 > vaddfp 0,0,1 > stvx 0,29,9 > addi 9,9,16 > bdnz .L29 > which is as good as the vectorizer can get, iinm: peeling the loop to align > the > store (and the load from the same address), treating the other two loads as > potentially unaligned. > To further optimize this loop we would probably want to overlap the store with > subsequent loads using -fmodulo-sched; perhaps the new export-ddg can help > with > that. I was able to get about 20% more in one case with malloc. I was expecting something like 2-4 times faster when the vectorization is enabled. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #16 from eyal at geomage dot com 2008-02-08 08:55 --- Thanks a lot Ira, I appriciate it. If you need the full test code with .vect file and makefiles,please let me know. thanks, eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #19 from eyal at geomage dot com 2008-02-10 07:42 --- Hi, This is the simplest test I have. #include #include #include typedef float ARRTYPE; int main ( int argc, char *argv[] ) { int m_nSamples = atoi( argv[1] ); int itBegin = atoi( argv[2] ); int itEnd = atoi( argv[3] ); int iSizeMain = atoi( argv[ 4 ] ); ARRTYPE *pSum1 = new ARRTYPE[ 10 ]; ARRTYPE *pSum = new ARRTYPE[ 10 ]; for ( int it = 0; it < m_nSamples; it++ ) { pSum[ it ] = it / itBegin; pSum1[ it ] = itBegin / ( it + 1 ); } ARRTYPE *pVec1 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples); ARRTYPE *pVec2 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples); for ( int i = 0; i < m_nSamples - 5; i++ ) { for( int it = itBegin; it < itEnd; it++ ) pVec1[ it ] += pSum[ it ] + pSum1[ it ]; } free( pVec1 ); free( pVec2 ); } // Test - Vectorized binary, TestNoVec - Non vectorized binary time ./Test 9 1 89900 1 real0m23.273s time ./TestNoVec 9 1 89900 1 real0m24.344s This is the compiler output I found relevant, please let me know if you need more information. Test.cpp:24: note: dependence distance modulo vf == 0 between *D.22310_50 and *D.22310_50 Test.cpp:24: note: versioning for alias required: can't determine dependence between *D.22312_54 and *D.22310_50 Test.cpp:24: note: mark for run-time aliasing test between *D.22312_54 and *D.22310_50 Test.cpp:24: note: versioning for alias required: can't determine dependence between *D.22314_58 and *D.22310_50 Test.cpp:24: note: mark for run-time aliasing test between *D.22314_58 and *D.22310_50 Test.cpp:24: note: create runtime check for data references *D.22312_54 and *D.22310_50 Test.cpp:24: note: create runtime check for data references *D.22314_58 and *D.22310_50 Test.cpp:24: note: created 2 versioning for alias checks. Test.cpp:24: note: LOOP VECTORIZED.(get_loop_exit_condition D.22310_50 = pVec1_37 + D.22309_49; D.22312_54 = pSum_20 + D.22309_49; D.22314_58 = pSum1_18 + D.22309_49; -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #20 from eyal at geomage dot com 2008-02-10 07:56 --- Hi, I've tried putting the loop to be vectorized in a different method and the compiler output looks better, but the performance is still the same as the non-vectorized code. #include #include #include typedef float ARRTYPE; void Calc( ARRTYPE *pSum, ARRTYPE *pSum1, ARRTYPE *pVec1, ARRTYPE *pVec2, int m_nSamples, int itBegin, int itEnd ); int main ( int argc, char *argv[] ) { int m_nSamples = atoi( argv[1] ); int itBegin = atoi( argv[2] ); int itEnd = atoi( argv[3] ); int iSizeMain = atoi( argv[ 4 ] ); ARRTYPE *pSum1 = new ARRTYPE[ 10 ]; ARRTYPE *pSum = new ARRTYPE[ 10 ]; for ( int it = 0; it < m_nSamples; it++ ) { pSum[ it ] = it / itBegin; pSum1[ it ] = itBegin / ( it + 1 ); } ARRTYPE *pVec1 = NULL, *pVec2 = NULL; Calc( pSum, pSum1, pVec1, pVec2, m_nSamples, itBegin, itEnd ); std::cout << "pVec1[10] = " << pVec1[ 10 ] << std::endl; std::cout << "pVec1[102] = " << pVec1[ 102 ] << std::endl; free( pVec1 ); free( pVec2 ); } void Calc( ARRTYPE *pSum, ARRTYPE *pSum1, ARRTYPE *pVec1, ARRTYPE *pVec2, int m_nSamples, int itBegin, int itEnd ) { pVec1 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples); pVec2 = (ARRTYPE*) malloc (sizeof(ARRTYPE) *m_nSamples); for ( int i = 0; i < m_nSamples - 5; i++ ) { for( int it = itBegin; it < itEnd; it++ ) pVec1[ it ] += pSum[ it ] + pSum1[ it ]; } } Eyal.cpp:36: note: dependence distance = 0. Eyal.cpp:36: note: accesses have the same alignment. Eyal.cpp:36: note: dependence distance modulo vf == 0 between *D.22348_22 and *D.22348_22 Eyal.cpp:36: note: === vect_analyze_slp === Eyal.cpp:36: note: === vect_make_slp_decision === Eyal.cpp:36: note: === vect_detect_hybrid_slp ===(analyze_scalar_evolution (loop_nb = 2) (scalar = it_60) (get_scalar_evolution (scalar = it_60) (scalar_evolution = {itBegin_14(D), +, 1}_2)) (set_scalar_evolution (scalar = it_60) (scalar_evolution = {itBegin_14(D), +, 1}_2)) ) (instantiate_parameters (loop_nb = 2) (chrec = {itBegin_14(D), +, 1}_2) (res = {itBegin_14(D), +, 1}_2)) (get_loop_exit_condition if (itEnd_16(D) > it_36)) Eyal.cpp:36: note: Alignment of access forced using peeling. Eyal.cpp:36: note: Vectorizing an unaligned access. Eyal.cpp:36: note: Vectorizing an unaligned access. Eyal.cpp:36: note: === vect_update_slp_costs_according_to_vf ===(analyze_scalar_evolution (loop_nb = 2) (scalar = it_60) (get_scalar_evolution (scalar = it_60) (scalar_evolution = {itBegin_14(D), +, 1}_2)) (set_scalar_evolution (scalar = it_60) (scalar_evolution = {itBegin_14(D), +, 1}_2)) ) (instantiate_parameters (loop_nb = 2) (chrec = {itBegin_14(D), +, 1}_2) (res = {itBegin_14(D), +, 1}_2)) (get_loop_exit_condition if (itEnd_16(D) > it_36)) (get_loop_exit_condition if (itEnd_16(D) > it_36)) (get_loop_exit_condition if (itEnd_16(D) > it_84)) (get_loop_exit_condition if (ivtmp.267_92 < prolog_loop_niters.266_70)) loop at Eyal.cpp:37: if (ivtmp.267_92 < prolog_loop_niters.266_70)(get_loop_exit_condition if (itEnd_16(D) > it_36)) (analyze_scalar_evolution (loop_nb = 2) (scalar = it_60) (get_scalar_evolution (scalar = it_60) (scalar_evolution = )) (analyze_initial_condition (loop_phi_node = it_60 = PHI ) (init_cond = it_86)) (analyze_evolution_in_loop (loop_phi_node = it_60 = PHI ) (add_to_evolution (loop_nb = 2) (chrec_before = it_86) (to_add = 1) (res = {it_86, +, 1}_2)) (evolution_function = {it_86, +, 1}_2)) (set_scalar_evolution (scalar = it_60) (scalar_evolution = {it_86, +, 1}_2)) ) (get_loop_exit_condition if (itEnd_16(D) > it_36)) (get_loop_exit_condition if (ivtmp.329_211 < bnd.269_99)) loop at Eyal.cpp:37: if (ivtmp.329_211 < bnd.269_99) Registering new PHI nodes in block #0 Registering new PHI nodes in block #2 Updating SSA information for statement D.22335_6 = malloc (D.22334_5); Updating SSA information for statement malloc (D.22334_5); Registering new PHI nodes in block #3 Registering new PHI nodes in block #9 Registering new PHI nodes in block #7 Registering new PHI nodes in block #8 Registering new PHI nodes in block #10 Registering new PHI nodes in block #14 Registering new PHI nodes in block #12 Updating SSA information for statement D.22349_76 = *D.22348_75; Updating SSA information for statement *D.22348_75 = D.22355_82; Registering new PHI nodes in block #13 Registering new PHI nodes in block #16 Registering new PHI nodes in block #15 Registering new PHI nodes in block #21 Registering new PHI nodes in block #22 Registering new PHI nodes in block #19 Updating SSA informatio
[Bug c++/35117] Vectorization on power PC
--- Comment #21 from eyal at geomage dot com 2008-02-10 13:48 --- (In reply to comment #14) > Giving it another thought, this is not necessary an alias analysis issue, even > that it fails to tell that the pointers not alias. Since in this case the > pointers do differ, the runtime test should take the flow to the vectorized > loop. Maybe the test is too strict. I'll look into this on Sunday. Hi, Any update on this matter? thanks eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #23 from eyal at geomage dot com 2008-02-10 15:47 --- (In reply to comment #22) > 1. It looks like vectorizer was enabled in both cases, since -O3 enables the > vectorizer by the default. You need to add -fno-tree-vectorize to disable it > explicitly. > 2. To get better results from vectorized version I would recommend to allocate > arrays at boundaries aligned to 16 byte and let to the compiler to know this. > You can do it by static allocation of arrays: > float pSum1[64000] __attribute__ ((__aligned__(16))); > float pSum[64000] __attribute__ ((__aligned__(16))); > float pVec1[64000] __attribute__ ((__aligned__(16))); > 3. It would be better if "itBegin" will start from 0 and be known at compile > time. This and [2] will allow to vectorizer to save realigning loads. > 4. For some strange reason the run time of this test can vary significantly > (up > to 50%) from run to run. So be sure to run it several times. > -- Victor. Hi, Item 2 is problematic as the data can vary a lot and I cant use static arrays. Im also willing to pay a "reasonable" price for the alignment extra actions. Item 3: I cant make itBegin start from zero, since this is how the formula we're using works. Its calculated everytime and can vary in value. Item 4: I saw consistent results everytime I ran it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #27 from eyal at geomage dot com 2008-02-11 14:00 --- Hi, I am a bit lost and appriciate your guidelines. Up till now, after all those emails, I still have no clue as to why such a simple test case doesnt work. As far as I understood the vectorization should have shown between 2 to 4 times faster. With all the suggestions here I still didnt get more then 20-30% performance gain. I would appriciate if someone from the vectorization team could come up with detailed explaination as to how to make the vectorization do whats promised. As for the last email, Victor: 1. Using a smaller number of iterations, doesnt help me. This is not what the real world code runs. 2. new/malloc almost didnt do anything maybe a gain of 20% 3. The difference between 1.738sec and 0.781sec can either be a 2 times performance gain or simply a 1 second gain that would remain 1 second for more intensive calculations. Therefore I cant use/rely on the test you did. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #30 from eyal at geomage dot com 2008-02-12 08:43 --- Hi, Thanks a lot for the input about a potential memory bottle-neck. I indeed was under the impression that once I got the loop vectorized, I'd immidiatly see a performance boost. I would appriciate, however, a further explaination about this issue. After all, this is a very simple test case. I still dont understand why the hugh diffference when I run: time ./TestNoVec 92200 8 89720 1000 real0m23.549s time ./TestVec 92200 8 89720 1000 real0m22.845s and when I run: [EMAIL PROTECTED]:~> time ./mnovec 40 1 29720 1000 real0m24.493s user0m24.483s sys 0m0.007s [EMAIL PROTECTED]:~> time ./mvec 40 1 29720 1000 real0m10.777s user0m10.771s sys 0m0.005s I cant see from the code how those parameter diff effect the performance so much. I'd appriciate your assistance again. thanks eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #32 from eyal at geomage dot com 2008-02-12 11:28 --- (In reply to comment #31) > > I would appriciate, however, a further explaination about this issue. > The explanation has to deal with CPU architecture and is not related to > compilers. In case of cache miss the memory load and store take tens of cpu > cycles instead of few cycles in case of cache hit. > When we run: > time ./mvec 40 1 29720 1000 > The program perform 40 iterations of outer loop and 29720 iterations in > internal loop. The internal loop performs 3 load accesses and one store access > per iteration. Starting from second iteration of outer loop, all 29720 > elements of arrays pSum, pSum1 and pVec1 will be placed into cache and from > this point all accesses will be cache hits. (I assume that data cache is big > enough to contain all 29720*3 elements). > Lets look at the slow run: > % time ./TestVec 92200 8 89720 1000 > Here the program perform (89720-8) iterations in internal loop, so in order to > have cache hits most of the time we need the cache to be at least 89712*3 in > size. Lets consider what will happen if cache size is only half of required > amount. After completion of first iteration of the outer loop, the cache will > be filled with second half of data from arrays. At start of second iteration > of outer loop, all elements from first half will be evicted from the cache as > most caches use LRU policy to choose evicted elements. Considering that > PPC970 > is out-of-order, multiple-issue architecture we can guess why CPU have enough > time to perform arithmetic operations even in scalar manner without adding any > overhead relatively to vectorized version of internal loop. Thanks a lot for the detailed explaination Victor. I'll try to see if I can break the real code to be more memory friendly. Again thanks a lot guys. eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117
[Bug c++/35117] Vectorization on power PC
--- Comment #33 from eyal at geomage dot com 2008-02-13 16:06 --- Hi All, I've done some changes that hopefully prevent the memory from being a performance bottleneck. I see a perf gain of ~10%. However the compiler still gives me the warnings in comment #19 - Test.cpp:24: note: versioning for alias required: can't determine dependence between *D.22312_54 and *D.22310_50 Test.cpp:24: note: mark for run-time aliasing test between *D.22312_54 and *D.22310_50 Test.cpp:24: note: versioning for alias required: can't determine dependence between *D.22314_58 and *D.22310_50 Test.cpp:24: note: mark for run-time aliasing test between *D.22314_58 and *D.22310_50 Test.cpp:24: note: create runtime check for data references *D.22312_54 and *D.22310_50 Test.cpp:24: note: create runtime check for data references *D.22314_58 and *D.22310_50 Test.cpp:24: note: created 2 versioning for alias checks. Test.cpp:24: note: LOOP VECTORIZED.(get_loop_exit_condition How do I resolve those issues? which might prevent from the vectorized code to run and therefore I dont see a bigger performance improvement? I'd appriciate any assistance... thanks eyal -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117