On 10/30/14 23:36, Bin.Cheng wrote:
#2 would be the best solution for the case I was pondering, but I don't
think solving that case is terribly important given the processors for which
it was profitable haven't been made for a very long time.
I am thinking if it's possible to introduce a pattern-directed fusion.
Something like define_fusion, and adapting haifa-scheduler for it. I
agree there are two kinds (relevant and irrelevant) fusion types, and
it's not trivial to support both in one scheme. Do you have a
specific example that I can have a try?
I kicked around using reorg to do stuff like this in the past
(combination of unrelated insns). But ultimately I think the way to go
is have it happen when insns are on the ready list in the scheduler.
For fusion of related insns like the load/store pairing, I think your
approach should work pretty well.
As to specific examples of independent insn fusion, the ones I'm most
familiar with are from the older PA chips. I wouldn't recommend
building something for those processors simply becuase they're so dated
that I don't believe anyone uses them anymore.
However, if you have cases (arm shift insns?), building for those is
fine. If you just want examples, the ones we tried to exploit on the PA
were fmpyadd/fmpysub, movb,tr and addb,tr
fmpyadd/fmpysub combined independent floating point multiply with an FP
add or sub insn. There's many conditions, but if you want a simple
example to play with, the attached file with -O2 -mschedule=7100LC ought
to generate one of these insns via pa_reorg.
addb,tr can combine an unconditional branch with a reg+reg or reg+imm5
addition operation. movb,tr combines an unconditional branch with a
reg-reg copy or load of a 5 bit immediate value into a general register.
I don't happen to have examples handy, but compiling integer code with
-O2 -mschedule=7100LC ought to trigger some.
The code in pa_reorg is O(n^2) or worse. It predates the hooks to allow
the target to reorder the ready queue. It would probably be relatively
easy to have that code run via those hooks and just look at the ready
queue. So it'd still be O(n^2), but the N would be *much* smaller. But
again, I don't think anyone uses PA7xxxx processors and hasn't for over
a decade, so it hasn't seemed worth the effort to change.
Cheers,
Jeff
*> \brief \b CLARSCL2 performs reciprocal diagonal scaling on a vector.
*
* =========== DOCUMENTATION ===========
*
* Online html documentation available at
* http://www.netlib.org/lapack/explore-html/
*
*> \htmlonly
*> Download CLARSCL2 + dependencies
*> <a href="http://www.netlib.org/cgi-bin/netlibfiles.tgz?format=tgz&filename=/lapack/lapack_routine/clarscl2.f">
*> [TGZ]</a>
*> <a href="http://www.netlib.org/cgi-bin/netlibfiles.zip?format=zip&filename=/lapack/lapack_routine/clarscl2.f">
*> [ZIP]</a>
*> <a href="http://www.netlib.org/cgi-bin/netlibfiles.txt?format=txt&filename=/lapack/lapack_routine/clarscl2.f">
*> [TXT]</a>
*> \endhtmlonly
*
* Definition:
* ===========
*
* SUBROUTINE CLARSCL2 ( M, N, D, X, LDX )
*
* .. Scalar Arguments ..
* INTEGER M, N, LDX
* ..
* .. Array Arguments ..
* COMPLEX X( LDX, * )
* REAL D( * )
* ..
*
*
*> \par Purpose:
* =============
*>
*> \verbatim
*>
*> CLARSCL2 performs a reciprocal diagonal scaling on an vector:
*> x <-- inv(D) * x
*> where the REAL diagonal matrix D is stored as a vector.
*>
*> Eventually to be replaced by BLAS_cge_diag_scale in the new BLAS
*> standard.
*> \endverbatim
*
* Arguments:
* ==========
*
*> \param[in] M
*> \verbatim
*> M is INTEGER
*> The number of rows of D and X. M >= 0.
*> \endverbatim
*>
*> \param[in] N
*> \verbatim
*> N is INTEGER
*> The number of columns of D and X. N >= 0.
*> \endverbatim
*>
*> \param[in] D
*> \verbatim
*> D is REAL array, length M
*> Diagonal matrix D, stored as a vector of length M.
*> \endverbatim
*>
*> \param[in,out] X
*> \verbatim
*> X is COMPLEX array, dimension (LDX,N)
*> On entry, the vector X to be scaled by D.
*> On exit, the scaled vector.
*> \endverbatim
*>
*> \param[in] LDX
*> \verbatim
*> LDX is INTEGER
*> The leading dimension of the vector X. LDX >= 0.
*> \endverbatim
*
* Authors:
* ========
*
*> \author Univ. of Tennessee
*> \author Univ. of California Berkeley
*> \author Univ. of Colorado Denver
*> \author NAG Ltd.
*
*> \date September 2012
*
*> \ingroup complexOTHERcomputational
*
* =====================================================================
SUBROUTINE CLARSCL2 ( M, N, D, X, LDX )
*
* -- LAPACK computational routine (version 3.4.2) --
* -- LAPACK is a software package provided by Univ. of Tennessee, --
* -- Univ. of California Berkeley, Univ. of Colorado Denver and NAG Ltd..--
* September 2012
*
* .. Scalar Arguments ..
INTEGER M, N, LDX
* ..
* .. Array Arguments ..
COMPLEX X( LDX, * )
REAL D( * )
* ..
*
* =====================================================================
*
* .. Local Scalars ..
INTEGER I, J
* ..
* .. Executable Statements ..
*
DO J = 1, N
DO I = 1, M
X( I, J ) = X( I, J ) / D( I )
END DO
END DO
RETURN
END