from:"Andrew STUBBS"

Subversion access via http proxy

2005-10-21 Thread Andrew STUBBS


Hi,

I used to be able to access the svn.toolchain.org test repository 
through our http proxy (the firewall will not permit svn or even svn+ssh 
directly).


That repository had a published username and password for anonymous 
access. The instructions are still in the wiki, although you have to go 
back a few versions now.


Is there any equivalent for the new gcc.gnu.org repository? I tried 
anoncvs and anonsvn, among others. If so could somebody put it on the wiki.


Thanks.

Andrew Stubbs

P.S. The wiki page seems to have got a little confused. It used to have 
three different examples of how to check out, but now has three very 
similar examples (some broken). Now that there is a read only non-secure 
service set up the changes to these examples can probably be largely 
reverted.

Re: git conversion in progress

2020-01-14 Thread Andrew Stubbs


On 14/01/2020 13:00, Jonathan Wakely wrote:

On Tue, 14 Jan 2020 at 11:37, Georg-Johann Lay  wrote:


Am 14.01.20 um 12:34 schrieb Andreas Schwab:

On Jan 14 2020, Georg-Johann Lay wrote:


git clone --reference original-gcc ...


Don't use --reference.  It is too easy to lose work if you don't know
what you are doing.

Andreas.


Well, then it should not be proposed in git.html then?


It's a work in progress. I've already suggested that worktrees are a
better solution for people who want to work on multiple branches but
save disk space.


Worktrees are better, for most purposes, but they still have the same 
risks as reference repos "I don't need that old tree any more."


Both are best left for the advanced tricks section.

Andrew

Re: Wrong GCC PR2020 annotated for "[committed, libgomp,amdgcn] Fix plugin-gcn.c bug"

2020-01-23 Thread Andrew Stubbs


On 23/01/2020 16:46, Joseph Myers wrote:

On Thu, 23 Jan 2020, Richard Earnshaw (lists) wrote:


Perhaps we should restrict that to a single line, ie only horizontal white
space.  Our commit style isn't really that free-form when citing bugs.  Or
perhaps require [:.]?\w after the number (ie an optional period or colon and
then some white space).


I presume it's intended to handle a paragraph where the reference to bug
2020 happens to have a line break in the middle of it.

This is code maintained by overseers
(/sourceware/infra/bin/email-to-bugzilla, version-controlled in CVS,
shared by all sourceware projects) which hasn't changed since 2014.  The
only GCC-specific piece is a very thin wrapper
(~gccadmin/hooks-bin/email-to-bugzilla-filtered) based on that used by
binutils-gdb.


Indeed, PR2019 has a number of unrelated commits referenced, and pr2018 
has one too. The years before that appear to have escaped the problem.


Andrew

Re: [PATCH, v3] wwwdocs: e-mail subject lines for contributions

2020-02-04 Thread Andrew Stubbs


On 03/02/2020 18:09, Michael Matz wrote:

But suggesting that using the subject line for tagging is recommended can
lead to subjects like

  [PATCH][GCC][Foo][component] Fix foo component bootstrap failure

in an e-mail directed to gcc-patc...@gcc.gnu.org (from somewhen last year,
where Foo/foo was an architecture; I'm really not trying to single out the
author).  That is, _none_ of the four tags carried any informational
content.


I partially disagree with this. Certainly there's pointless redundancy 
it this example, but I'd rather have the tags with a meaningful subject 
than a generic subject with no tags.


gcc-patches is a high-volume list in which most of the content is 
outside my range of interest and/or understanding. If I stay on top of 
it then I can read all the subject lines, at least, and probably select 
a few threads to learn about something new, but if I let the list get 
away from me for even a short while then it's too much to handle.


I do have filters set up to highlight subjects for which I should pay 
attention and if people are in the habit of tagging subjects then that 
becomes much more reliable.


Conversely, the tags help me quickly decide what not to read.

I see that some people are using a "[email tag] git tag: msg" format, 
and I quite like that.


Andrew

Re: Branch instructions that depend on target distance

2020-02-24 Thread Andrew Stubbs


On 24/02/2020 11:05, Petr Tesarik wrote:

Hi all,

I'm looking into reviving the efforts to port gcc to VideoCore IV [1].
One issue I've run into is the need to find out target branch distance
at compile time. I looked around, and it's not the first one
architecture with such requirement, but AFAICS it has never been solved
properly.

For example, AVR tracks instruction length. Later, ret_cond_branch()
selects between a branch instruction and an inverted branch followed by
an unconditional jump based on these calculated lengths.

This works great ... until there's some inline asm() statement, for
which gcc cannot keep track of the length attribute, so it is probably
taken as zero. Linker then fails with a cryptic message:


relocation truncated to fit: R_AVR_7_PCREL against `no symbol'


You can probably fix this by implementing the ADJUST_INSN_LENGTH macro 
and recognising the inline assembler. See the internals manual.


We encountered similar issues with the recent GCN port, and the correct 
solution was to add the length attribute everywhere. The attributes are 
often conservative estimates (rather than having extra alternatives for 
every possible encoding), so the asm problem is mitigated somewhat, at 
the cost of a few "far" branches where they're not strictly necessary.


There were also addition problems because "far" branches clobber the 
condition register, and "near" branches do not, but that's another story.


Andrew

Masked vector deficiencies

2020-03-03 Thread Andrew Stubbs


Hi all,

Up until now the AMD GCN port has been using exclusively 64-lane vectors 
with masking for smaller sizes.


This works quite well, where it works, but there remain many test cases 
(and no doubt some real code) that refuse to vectorize because the 
number of iterations (or SLP equivalent) are smaller than the 
vectorization factor.


My question is: are there any plans to fill in these missing cases? Or, 
is relying on masking alone just not feasible?


I've dabbled in the vectorizer code, of course, but I can't claim to 
have much of a feel for it as a whole. I may be able to help with the 
effort in future, but for now I'm struggling to judge what's even needed.


For GCN the vectorization is quite important as scalar code is slow, and 
adding vectorization is usually cheap. The architecture can do any 
vector size between 1 and 64 lanes (not just powers of two), so being 
smaller than the vectorization factor really ought not be a problem.


To fix this, I've been considering adding extra vector sizes (probably 
2, 4, 8, 16, 32) where the backend would take care of the masking. 
Asside from reductions and permutations the changes would be somewhat 
trivial, but the explosion in the number of generated patterns would be 
enormous, and it still won't allow arbitrary size vectors.


Thank you for your time; I'm trying to decide where my efforts should lie.

Andrew

Re: Masked vector deficiencies

2020-03-03 Thread Andrew Stubbs


On 03/03/2020 15:57, Richard Sandiford wrote:

Andrew Stubbs  writes:

Hi all,

Up until now the AMD GCN port has been using exclusively 64-lane vectors
with masking for smaller sizes.

This works quite well, where it works, but there remain many test cases
(and no doubt some real code) that refuse to vectorize because the
number of iterations (or SLP equivalent) are smaller than the
vectorization factor.

My question is: are there any plans to fill in these missing cases? Or,
is relying on masking alone just not feasible?


This is supported for loop vectorisation.  E.g.:

   void f (short *x) { for (int i = 0; i < 7; ++i) x[i] += 1; }

generates:

 ptrue   p0.h, vl7
 ld1hz0.h, p0/z, [x0]
 add z0.h, z0.h, #1
 st1hz0.h, p0, [x0]
 ret


Yes, this works on GCN, albeit not quite so prettily:

 s_mov_b64   exec, -1
 v_mov_b32   v0, 0
 s_mov_b64   exec, 127
 flat_load_ushortv0, v[4:5]
 s_waitcnt   0
 s_mov_b64   exec, -1
 v_add_u32   v0, vcc, 1, v0
 s_mov_b64   exec, 127
 flat_store_shortv[4:5], v0
 s_setpc_b64 s[18:19]


for SVE.  BB SLP is on the wish-list for GCC 11, but no promises. :-)

Early peeling/complete unrolling can cause loops to be straight-line
code by the time the vectoriser sees them.  E.g. the loop above doesn't
use masked SVE for "i < 3".

Which kind of cases fail for GCN?


Certainly SLP account for many of them; gfortran.dg/vect/pr62283-2.f 
says "unsupported data-type real(kind=4)", which I think is another way 
of saying it wants a vector of precisely 4 elements.


For loops, examples are gcc.dg/vect/vect-reduc-1char.c and its 
relations. The "big-array" variants of the same tests vectorize just fine.


Andrew

Re: Blog post about static analyzer in GCC 10

2020-03-31 Thread Andrew Stubbs


On 26/03/2020 22:30, David Malcolm via Gcc wrote:

I wrote a blog post "Static analysis in GCC 10" giving an idea of the
current status of the -fanalyzer feature:
https://developers.redhat.com/blog/2020/03/26/static-analysis-in-gcc-10/

At some point I'll write up the material for our changes.html page.


This is some very cool stuff! :-)

It's too bad that tmux has been hiding this from me up to now. Fingers 
crossed for an implementation soon.


Just one thing that ought to be fixed before GCC 10 though: the URL for 
the -Wanalyzer-double-free option points to the wrong documentation 
page. It points to Warning-Options.html when it should be 
Static-Analyzer-Options.html.


I expect you knew that already. I'll shut up now.

Andrew

PCH test errors

2020-05-27 Thread Andrew Stubbs

I'm testing amdgcn-amdhsa, and I get lot of PCH test failures with 
errors like this:


gcc.dg/pch/common-1.c:1:22: error: one or more PCH files were found, but 
they were invalid

gcc.dg/pch/common-1.c:1:22: error: use -Winvalid-pch for more information
gcc.dg/pch/common-1.c:1:10: fatal error: common-1.h: No such file or 
directory


It may affect other targets, but I've not tested those. Has anybody else 
seen this issue?


I've done a bisect, and found it first starts with this:

   1dedc12d1: revamp dump and aux output names

I've also seen others complaining about other issues with this commit, 
so this may be a duplicate issue, somehow.


Andrew

Re: PCH test errors

2020-05-28 Thread Andrew Stubbs


On 27/05/2020 15:46, Andrew Stubbs wrote:
I'm testing amdgcn-amdhsa, and I get lot of PCH test failures with 
errors like this:


gcc.dg/pch/common-1.c:1:22: error: one or more PCH files were found, but 
they were invalid

gcc.dg/pch/common-1.c:1:22: error: use -Winvalid-pch for more information
gcc.dg/pch/common-1.c:1:10: fatal error: common-1.h: No such file or 
directory


Hi Alexandre,

I've created a test toolchain for you on the GCC compute farm: 
gcc14.fsffrance.org


Please see /home/ams_cs/gccobj and /home/amd_cs/install.

I ran "make check RUNTESTFLAGS=pch.exp" already, so you should be able 
to see the errors in gcc/testsuite/gcc/gcc.log.


The sources and objects are set world writeable, so you should be able 
to experiment (please set "umask 0" so I can clean up after).


Andrew

Re: PCH test errors

2020-05-29 Thread Andrew Stubbs


On 29/05/2020 01:00, Alexandre Oliva wrote:

I understand the problem, and I'm tempted to say it was a latent
preexisting problem.

gcn-hsa.h defines -mlocal-symbol-id=%b in CC1_SPEC.

This is a target option not marked as pch_ignore, so
option_affects_pch_p returns true for it, and default_pch_valid_p in
targhooks.c compares the saved option in the PCH file with the active
option in the current compilation.


Thank you for the careful analysis. I would have struggled to get there 
myself.


That option is the vestige of a horrible, ugly workaround for an ELF 
loader bug in the GPU drivers, in which it would refuse to load a binary 
that had the same local symbol defined multiple times (e.g. from linking 
together .o files each containing the same local name).


That bug has been fixed in the driver, and I don't think the name 
mangling was ever committed to the upstream toolchain, but the (now 
inactive) option remains. I don't recall why; it may have been for 
backward compatibility, or it may have been an unintentional omission.


Either way, I don't think we need it in GCC 11, so I can just rip it out.

Thanks again

Andrew

Re: GCC 10.1.0 HELP

2020-06-11 Thread Andrew Stubbs


On 11/06/2020 08:40, MAHDI LOTFI via Gcc wrote:

[AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445]
[1002:6900] (rev 81)

I think my GPU is older than fiji and Vega AMDs.
Can GCC 10 support my GPU Model?


According to Wikipedia 
(https://en.wikipedia.org/wiki/Radeon_Rx_200_series), Topaz is GCN 1, 
and therefore too old.


GCC 10 supports only the specific devices listed.

Andrew

Re: Please put vim swap files into gitignore

2020-06-18 Thread Andrew Stubbs


On 18/06/2020 19:20, Thomas Koenig via Gcc wrote:

Hi,

I just found a few unversioned files called .intrinsic.c.swp and
similar in my "git status" output.

Could somebody please put .*.swp into .gitignore?  I'm sure this
would save at least 10 reverts :-)


I have this in my .vimrc to keep such junk out of my working directories.

" Keep .swp files all in one place
set dir=~/tmp
" same for undo files
set undodir=~/tmp
set undofile

Andrew

DWARF subregs

2020-06-26 Thread Andrew Stubbs


Hi all,

I'm trying to implement DWARF output for the AMD GCN target, and I've 
run into trouble; -O0 debug works pretty well, but there are some 
problems accessing variables in registers.


Problem 1 

The proposed DWARF specification for the target doesn't specify separate 
DWARF registers for the high and low parts of certain 64 bit registers 
(specifically EXEC and VCC), even though the hardware does.


Instead, one is expected to specify that EXEC_HI and VCC_HI are parts of 
EXEC and VCC, but I'm pretty sure GCC can't do that.


How can I express that in DWARF, and how should I go about implementing 
it in GCC? I think dwarf2out.c will need patching, but some clues about 
where would be welcome.


Problem 2 

The GCN architecture makes it common to have scalars located in vector 
registers (these are used with the other lanes masked off).


I have no problem expressing, in the DWARF, which register holds the 
variable, but rocgdb still wants to treat the value as a vector, which 
doesn't work so well in complex DWARF expressions.


The proposed DWARF specification includes a new directive 
"DW_OP_LLVM_push_lane" to handle this, but of course GCC does not 
support this yet.


How can I best implement this new feature, both in dwarf2out and in the 
target hooks?


The proposed standard changes are here:

http://llvm.org/docs/AMDGPUUsage.html#dwarf-debug-information
http://llvm.org/docs/AMDGPUDwarfProposalForHeterogeneousDebugging.html

Thanks in advance

Andrew

Re: How to refine autovectorized loop

2020-07-15 Thread Andrew Stubbs


On 15/07/2020 03:39, 夏 晋 via Gcc wrote:

Hi everyone,
   I'm trying to autovectorize the loop, and Thank you for the omnipotent 
macros, everything goes alright. But recently I need to further optimize the 
loop, I had some problems.
   As our vector instruction can process 16 numbers at the same time, if the 
for loop counter is equal or larger than 16, the loop will be autovectorized. 
For example:
   for (int i = 0; i <16; i++) c[i] = a[i] + b[i];
   will goes to:
   vld v0, a0
   vld v1, a1
   vadd v0,v0,v1
   vfst v0, a2
   And if I wrote code like: for (int i = 0; i <15; i++) c[i] = a[i] + b[i]; the 
autovectorization will miss it. But we got a instruction "vlen", which can change 
the length of the vector operation, and I wish to generate the assembler like this when the 
loop counter is 15:
   vlen 15
   vld v0, a0
   vld v1, a1
   vadd v0,v0,v1
   vfst v0, a2
   What should I do to achieve this goal? I've tried to "define TARGET_HAVE_DOLOOP_BEGIN" and define_expand 
"doloop_begin". and the "doloop_begin" won't be called. Is there any other way? and If the loop 
counter is bigger than 16 like 30,31 or just a varable, what should I do with "vlen". Any hint would be 
helpful. Thank you very much.



We have had similar issues with the AMD GCN port, in which the vector 
length is 64 and many smaller vectorizable cases get missed.


There are two solutions (that I know of):

1. Implement "masked" vectors. GCC will then use just a portion of the 
total vector in some cases. I don't know if your architecture can cope 
with arbitrary masks, but you can probably simulate them using vector 
conditionals, and still win (maybe). You can certainly recognise 
constant masks that mearly change the length. Probably the vectorizer 
code could be modified, via a new hook, to only generate masks that work 
for you (masks generated via WHILE_ULT would be fine, for example).


2. Add extra, smaller vector modes that work the same, but your backend 
inserts vlen adjustments as necessary (in the md_reorg pass, perhaps). 
You might have V2, V4, V8, and V16, for example.


Or both: for GCN, arbitrary masks work fine, but not all of GCC can take 
advantage of them, so I've been experimenting with adding multiple 
vector length modes to make up the difference.


Andrew

TImode for BITS_PER_WORD=32 targets

2020-07-24 Thread Andrew Stubbs


Hi all,

I want amdgcn to be able to support int128 types, partly because they 
might come up in code offloaded from x86_64 code, and partly because 
libgomp now requires at least some support (amdgcn builds have been 
failing since yesterday).


But, amdgcn has 32-bit registers, and therefore defines BITS_PER_WORD to 
32, which means that TImode doesn't Just Work, at least not for all 
operators. It already has TImode moves, for internal uses, so I can 
enable TImode and fix the libgomp build, but now libgfortran tries to 
use operators that don't exist, so I'm no better off.


The expand pass won't emit libgcc calls, like it does for DImode, and 
libgcc doesn't have the routines for it anyway. Neither does it support 
synthesized shifts or rotates for more than double-word types. 
(Multiple-word add and subtract appear to work fine, however.)


What would be the best (least effort) way to implement this?

I think I need shift, rotate, multiply, divide, and modulus, but there's 
probably more.


Thanks, any advise will be appreciated.

Andrew

Re: Clobber REG_CC only for some constraint alternatives?

2020-08-20 Thread Andrew Stubbs


On 20/08/2020 06:40, Senthil Kumar Selvaraj via Gcc wrote:

What I didn't understand was the (set-attr "cc")
part - as far I can tell, this results in (set_attr "cc_enabled" ...) in
all of the three substituted patterns, so I wondered why not just have
(set_attr "cc_enabled" ...) in the original define_insn instead.

I now realize that with (set-attr "cc"), the original
unsubstituted pattern will have only a (set_attr "cc" ...) and would
therefore not match the attr check for "enabled" - correctly so, as the
original insn pattern clobbers CRIS_CC0_REGNUM. Did I get that right?


The best (only?) way to understand define_subst is to read the expanded 
machine description. This is not written anywhere, by default, but 
there's a way to get it.


  cd /gcc
  make mddump
  less tmp-mddump.md

Not only are all the define_subst expanded, but so are all the other 
iterators.


HTH

Andrew

Import license issue

2020-09-14 Thread Andrew Stubbs


Hi All,

I need to update include/hsa.h to access some newer APIs. The existing 
file was created by copying from the user manual, thus side-stepping 
licensing issues, but the updated user manual omits some important 
details from the APIs I need (mostly the contents of structs and value 
of enums). Of course, I can go see those details in the source, but 
that's not the same thing.


So, what I would like to do is import the header files I need into the 
GCC sources; there's precedent for importing (unmodified) copyright 
files for libffi etc., AFAICT, but of course the license needs to be 
acceptable.


The relevant files are here:

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa.h
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_image.h

When I previously enquired about this on IRC I was advised that the 
Illinois license would be unacceptable because it contains an 
attribution clause that would require all binary distributors to credit 
AMD in their documentation, which seems like a reasonable position. I've 
requested that AMD provide a copy of these specific files with a more 
acceptable license, and I may yet be successful, but it's not that simple.


The problem is that GCC already has this exact same license in 
libsanitizer/LICENSE.TXT so, again reasonably, AMD want to know why that 
licence is acceptable and their license is not.


Looking at the files myself, there appears to be some kind of dual 
license thing going on, and the word "Illinois" doesn't actually appear 
in any libsanitizer source file (many of which contain an Apache license 
header). Does this mean that the Illinois license is not actually active 
here? Or is it that it is active and binary distributors really should 
be obeying this attribution clause already?


Can anybody help me untangle this, please?

Are the files acceptable, and if not, how is this different from the 
other cases?


Thanks very much

Andrew

Re: Import license issue

2020-09-21 Thread Andrew Stubbs


Ping.

On 14/09/2020 17:56, Andrew Stubbs wrote:

Hi All,

I need to update include/hsa.h to access some newer APIs. The existing 
file was created by copying from the user manual, thus side-stepping 
licensing issues, but the updated user manual omits some important 
details from the APIs I need (mostly the contents of structs and value 
of enums). Of course, I can go see those details in the source, but 
that's not the same thing.


So, what I would like to do is import the header files I need into the 
GCC sources; there's precedent for importing (unmodified) copyright 
files for libffi etc., AFAICT, but of course the license needs to be 
acceptable.


The relevant files are here:

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa.h
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h 

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_image.h 



When I previously enquired about this on IRC I was advised that the 
Illinois license would be unacceptable because it contains an 
attribution clause that would require all binary distributors to credit 
AMD in their documentation, which seems like a reasonable position. I've 
requested that AMD provide a copy of these specific files with a more 
acceptable license, and I may yet be successful, but it's not that simple.


The problem is that GCC already has this exact same license in 
libsanitizer/LICENSE.TXT so, again reasonably, AMD want to know why that 
licence is acceptable and their license is not.


Looking at the files myself, there appears to be some kind of dual 
license thing going on, and the word "Illinois" doesn't actually appear 
in any libsanitizer source file (many of which contain an Apache license 
header). Does this mean that the Illinois license is not actually active 
here? Or is it that it is active and binary distributors really should 
be obeying this attribution clause already?


Can anybody help me untangle this, please?

Are the files acceptable, and if not, how is this different from the 
other cases?


Thanks very much

Andrew

Re: Import license issue

2020-09-21 Thread Andrew Stubbs


On 21/09/2020 12:31, Richard Biener wrote:

On Mon, Sep 21, 2020 at 10:55 AM Andrew Stubbs  wrote:


Ping.


Sorry, but you won't get any help resolving license issues from the
mailing list.
Instead you should eventually ask the SC to "resolve" this issue with the FSF.


Agreed, I don't really expect legal advice, but I am hoping somebody on 
the list has some historical details that might help me.


Thanks

Andrew

32-bit build failure

2016-11-09 Thread Andrew Stubbs


Hi Martin,

It looks like your change r242000 broke builds on 32-bit hosts:

fold-const-call.c:1541:36: error: cannot convert 'size_t* {aka unsigned 
int*}' to 'long long unsigned int*' for argument '2' to 'const char* 
c_getstr(tree, long long unsigned int*)'


Basically, the code only works where HOST_WIDE_INT == size_t.

Andrew

Help with reload bug, please

2015-01-23 Thread Andrew Stubbs

How does reload ensure that an SImode value (re)loaded into an FP 
register has a valid stack index?


The FP load instruction allows a smaller index range than the integer 
equivalent, but nothing checks the destination register, only the source 
mode.


I'm trying to solve a problem in which GCC 4.1 gets this wrong, but 
AFAICT this code works exactly the same now as then (although I don't 
have a testcase). IOW, unless I'm missing something, the only reason 
this doesn't fail all the time is that it's quite rare for the register 
pressure to cause just this value to spill in a function that has a 
stack frame >1KB and the index happens to be beyond the limit.


My target is ARMv7a with VFP. The code is trying to cast an int to a 
float. The machine description is such that the preferred route is to 
load the value first into a general register, transfer it to the VFP 
register, and then convert. It's only possible to get it to load 
directly to the VFP register if all the general registers are in use. 
This makes it very hard to write a synthetic testcase.


I can "fix" the problem by rewriting arm_legitimate_index_p such that it 
assumes all SImode access might be VFP loads, but that seems suboptimal.


Any suggestions would be appreciated!

Thanks

Andrew

Re: Help with reload bug, please

2015-01-23 Thread Andrew Stubbs


On 23/01/15 16:34, Jeff Law wrote:

Just for reference, the PA allows a 14 bit displacement in memory
addresses which use integer registers, but only a 5 bit displacement for
FP registers.  Other than the displacement amounts, I suspect this is
the same core problem you have on your port.


Yes, that seems similar.


Ultimately all I could do way layer hack on hack.  I can't remember them
all.  The most significant ones were to first reject the larger offsets
for FP modes in GO_IF_LEGITIMATE_ADDRESS.  While it's still valid (and
relatively common on the PA) to access integer registers in FP modes or
vice-versa, this change was a huge help.


This is already the case; it does the right thing when the mode is SFmode.


Secondary reloads are critical.  When you detect a situation that won't
work, you have to allocate a secondary reload register so that you can
build up the address as well as all the reload_in/reload_out handling.
This is how you ensure that if the compiler did something like try to
load from memory using an integer mode into an FP register you've got an
scratch register for reloading the address if it is an out-of-range reg+
address.


SECONDARY_INPUT_RELOAD_CLASS is another missed opportunity. Just like 
the legitimate address stuff, this has checks for the various VFP 
classes, but reload detects the class in the same flawed way, so an 
integer reload gives GENERAL_REGS even when the destination is VFP. 
Within the macro there's no way to see the whole insn.



We may have used special constraints as well to allow loads/stores of
integer registers in FP modes to use the larger offset.


Do you have an example?

Thanks

Andrew

Should "can_create_pseudo_p" check "lra_in_progress"?

2018-10-05 Thread Andrew Stubbs

I just tracked down a "reload" bug and was very surprised to find that 
can_create_pseudo_p doesn't return false during register allocation when 
using LRA.


It's still defined like this:

#define can_create_pseudo_p() (!reload_in_progress && !reload_completed)

Is it deliberate that it doesn't check lra_in_progress?

Thanks

Andrew

define_subst question

2018-11-27 Thread Andrew Stubbs


I want to use define_subst like this:

(define_subst "vec_merge_with_vcc"
  [(set (match_operand 0)
(match_operand 1))
   (set (match_operand 2)
(match_operand 3))]
  ""
  [(parallel
 [(set (match_dup 0)
   (vec_merge
 (match_dup 1)
 (match_operand 4 "" "0")
 (match_operand 5 "" "e")))
  (set (match_dup 2)
   (and (match_dup 3)
(match_operand 5 "" "e")))])])

(Predicates and modes removed for brevity.)

This works perfectly except for operand 5. :-(

The problem is that I can't find a way to make them "match". As it is 
the substitution creates two operands, with different numbers, which 
obviously is not what I want.


The manual says that I can't do (match_dup 5), and indeed trying that 
results in a segmentation fault.


The "e" register class actually contains only one register, so I tried 
using (reg:DI EXEC_REG) for the second instance, but that caused an ICE 
elsewhere.


I could use an unspec to make the second reference implicit, or just lie 
to the compiler and pretend it's not there, but 


How can I do this properly, please?

Thanks

Andrew

New jump threading issue

2018-12-07 Thread Andrew Stubbs

Since the postreload_jump pass was added I'm having trouble with the AMD 
GCN port.


I have the following, after reload (RTL slightly simplified!):

(insn (set (reg scc) (gtu (reg s26) (reg s25
(jump_insn (set pc (if_then_else (ne scc 0)) (label_ref 46) pc)))
.
(insn (set (reg scc) (eq (reg s18) (const_int 0
(jump_insn (set pc (if_then_else (ne scc 0)) (label_ref 56) pc))
(code_label 46)
(insn (set (reg scc) (ge s25 0)))
(jump_insn (set pc (if_then_else (ne scc 0)) (label_ref 48) pc)))
.

and it's being transformed into this:

(insn (set (reg scc) (gtu (reg s26) (reg s25
(jump_insn (set pc (if_then_else (ne scc 0)) (label_ref 48) pc)))
.
(insn (set (reg scc) (eq (reg s18) (const_int 0
(jump_insn (set pc (if_then_else (ne scc 0)) (label_ref 56) pc))
[2 insn deleted]
.

So, basically, it seems to have decided that the final jump is always 
taken in one case, and never taken in the other, and I can't see any 
reason for either conclusion. s25 is an input parameter to the function.


It seems unlikely that the jump threading is that buggy, given it's an 
existing pass run again (right?), but I'm a bit confused about what I 
could be doing wrong here?


The affected function, udivsi3, is in libgcc, and every test case that 
calls it exits with the wrong output.


Any help appreciated. Full dumps attached.

Thanks

Andrew
;; Function __udivsi3 (__udivsi3, funcdef_no=3, decl_uid=1439, cgraph_uid=4, 
symbol_order=3)

  Creating newreg=467
Removing SCRATCH in insn #17 (nop 3)
rescanning insn with uid = 17.
  Creating newreg=468
Removing SCRATCH in insn #22 (nop 3)
rescanning insn with uid = 22.
  Creating newreg=469
Removing SCRATCH in insn #23 (nop 3)
rescanning insn with uid = 23.
  Creating newreg=470
Removing SCRATCH in insn #38 (nop 3)
rescanning insn with uid = 38.
  Creating newreg=471
Removing SCRATCH in insn #43 (nop 3)
rescanning insn with uid = 43.
  Creating newreg=472
Removing SCRATCH in insn #45 (nop 3)
rescanning insn with uid = 45.
  Creating newreg=473
Removing SCRATCH in insn #52 (nop 3)
rescanning insn with uid = 52.
  Creating newreg=474
Removing SCRATCH in insn #59 (nop 3)
rescanning insn with uid = 59.
  Creating newreg=475
Removing SCRATCH in insn #64 (nop 3)
rescanning insn with uid = 64.
  Creating newreg=476
Removing SCRATCH in insn #66 (nop 3)
  Creating newreg=477
Removing SCRATCH in insn #66 (nop 4)
rescanning insn with uid = 66.
  Creating newreg=478
Removing SCRATCH in insn #67 (nop 3)
rescanning insn with uid = 67.
  Creating newreg=479
Removing SCRATCH in insn #70 (nop 3)
rescanning insn with uid = 70.
  Creating newreg=480
Removing SCRATCH in insn #71 (nop 3)
rescanning insn with uid = 71.
  Creating newreg=481
Removing SCRATCH in insn #74 (nop 3)
rescanning insn with uid = 74.

** Local #1: **

   Spilling non-eliminable hard regs: 16 17
New elimination table:
Can eliminate 416 to 16 (offset=-8, prev_offset=0)
Can eliminate 416 to 14 (offset=-8, prev_offset=0)
Can eliminate 418 to 16 (offset=0, prev_offset=0)
Can eliminate 418 to 14 (offset=0, prev_offset=0)
  alt=0,overall=0,losers=0,rld_nregs=0
 Choosing alt 0 in insn 16:  (0) =cs  (2) SSA  (3) SSA {cstoresi4}
  alt=0,overall=0,losers=0,rld_nregs=0
 Choosing alt 0 in insn 6:  (0) =SD  (1) SSA {*movsi_insn}
3 Scratch win: reject+=2
  alt=0,overall=2,losers=0,rld_nregs=0
 Choosing alt 0 in insn 17:  (2) ca  (3) =cs {cjump}
  Change to class SCC_CONDITIONAL_REG for r467
3 Scratch win: reject+=2
  alt=0,overall=2,losers=0,rld_nregs=0
 Choosing alt 0 in insn 22:  (0) =Sg  (1) SgB  (2) SgA  (3) =cs 
{ashlsi3}
  Change to class SCC_CONDITIONAL_REG for r468
3 Scratch win: reject+=2
  alt=0,overall=2,losers=0,rld_nregs=0
 Choosing alt 0 in insn 23:  (0) =Sg  (1) SgB  (2) SgA  (3) =cs 
{ashlsi3}
  Change to class SCC_CONDITIONAL_REG for r469
0 Small class reload: reject+=3
0 Non input pseudo reload: reject++
  alt=0,overall=10,losers=1,rld_nregs=1
0 Small class reload: reject+=3
0 Non input pseudo reload: reject++
  alt=1,overall=10,losers=1,rld_nregs=1
0 Small class reload: reject+=3
0 Non input pseudo reload: reject++
  alt=2,overall=10,losers=1,rld_nregs=1
0 Small class reload: reject+=3
0 Non input pseudo reload: reject++
  alt=3,overall=10,losers=1,rld_nregs=1
 Choosing alt 0 in insn 24:  (0) =cs  (2) SSA  (3) SSA {cstoresi4}
  Creating newreg=482 from oldreg=442, assigning class SCC_CONDITIONAL_REG 
to r482
   24: r482:BI=gtu(r438:SI,r428:SI)
Inserting insn reload after:
  111: r442:BI=r482:BI

0 Small class reload: reject+=3
0 Non input pseudo reload: reject++
  alt=0,overall=10,losers=1,rld_nregs=1
0 Small

Re: New jump threading issue

2018-12-10 Thread Andrew Stubbs


On 07/12/2018 22:41, Segher Boessenkool wrote:

On Fri, Dec 07, 2018 at 05:57:39PM +, Andrew Stubbs wrote:

Since the postreload_jump pass was added I'm having trouble with the AMD
GCN port.


[ snip a lot ]

It seems thread_jump does not notice your scc in its "nonequal" regset,
so it thinks every later jump is based on the same scc setting as the
first, which explains this behaviour.  Is this true, and if so, what
causes it?


It looks like thread_jump (or maybe mark_effect) is broken when a cjump 
also clobbers the condition register.


If I remove the clobber from the machine description then all is well 
(as long as there are none of the "far" branches that clobber scc).


There are a few issues here:

1. The clobber on the first cjump is not taken into account (AFAICT).

2. The clobber on the second cjump is irrelevant, but causes the bit to 
get cleared.


3. I'm not sure I understand the logic of why mark_effect clears the 
nonequal bit for clobbers at all; surely a clobber makes something 
"non-equal" just as effectively as a set?


Is it even possible for jump threading to work when the register is 
clobbered? (It's not obvious to me that reloading the same condition 
would be detected by this algorithm, but then I don't quite follow the 
"equals" logic, yet.)


Any suggestions what an acceptable fix might be?

Thanks

Andrew

RTL alternative selection question

2019-09-23 Thread Andrew Stubbs


Hi All,

I'm trying to figure out how to prevent LRA selecting alternatives that 
result in values being copied from A to B for one instruction, and then 
immediately back from B to A again, when there are apparently more 
sensible alternatives available.


I have an insn with the following pattern (simplified here):

  [(set (match_operand:DI 0 "register_operand"  "=Sg,v")
(ashift:DI
  (match_operand:DI 1 "gcn_alu_operand" " Sg,v")
  (match_operand:SI 2 "gcn_alu_operand" " Sg,v")))
   (clobber (match_scratch:BI 3 "=cs,X"))]

There are two lshl instructions; one for scalar registers and one for 
vector registers. The vector here has only a single element, so the two 
are equivalent, but we need to pick one.


This operation works for both register files, but there are other 
operations that exist only on one side or the other, so we want those to 
determine in which register file the values are allocated.


Unfortunately, the compiler (almost?) exclusively selects the second 
alternative, even when this means moving the values from one register 
file to the other, and then back again.


The problem is that the scalar instruction clobbers the CC register, 
which results in a "reject++" for that alternative in the LRA dump.


I can fix this by disparaging the second alternative in the pattern:

   (clobber (match_scratch:BI 3 "=cs,?X"))

This appears to do the right thing. I can now see both kinds of shift 
appearing in the assembly dumps.


But that does "reject+=6", which makes me worry that the balance has now 
shifted too far the other way.


Does this make sense?

   (clobber (match_scratch:BI 3 "=^cs,?X"))

Is there a better way to discourage the copies? Perhaps without editing 
all the patterns?


What I want is for the two alternatives to appear equal when the CC 
register is not live, and when CC is live for LRA to be able to choose 
between reloading CC or switching to the other alternative according to 
the situation, not on the pattern alone.


Thanks in advance.

Andrew

Re: RTL alternative selection question

2019-09-23 Thread Andrew Stubbs


On 23/09/2019 15:15, Segher Boessenkool wrote:

On Mon, Sep 23, 2019 at 11:56:27AM +0100, Andrew Stubbs wrote:

   [(set (match_operand:DI 0 "register_operand"  "=Sg,v")
 (ashift:DI
   (match_operand:DI 1 "gcn_alu_operand" " Sg,v")
   (match_operand:SI 2 "gcn_alu_operand" " Sg,v")))
(clobber (match_scratch:BI 3 "=cs,X"))]



Unfortunately, the compiler (almost?) exclusively selects the second
alternative, even when this means moving the values from one register
file to the other, and then back again.

The problem is that the scalar instruction clobbers the CC register,
which results in a "reject++" for that alternative in the LRA dump.


What kind of reject?  It prints a reason, too.


 0 Non input pseudo reload: reject++


Maybe we should make a macro/hook to never do that for your target, for
those flags registers anyway.


That wouldn't be horrible. I suppose I could look at doing that. Any 
suggestions how it should or should not work?


Andrew

Re: RTL alternative selection question

2019-09-23 Thread Andrew Stubbs


On 23/09/2019 16:21, Segher Boessenkool wrote:

Pass the register class or constraint or something like that to the hook,
then based on what the hook returns, either or not do the reject?  So your
hook would special-case SCC_CONDITIONAL_REG, maybe a few more similar ones
(those are confusing names btw, _REG but they are register classes).


It's a class of one register!

Thanks

Andrew

Re: RTL alternative selection question

2019-10-01 Thread Andrew Stubbs


On 23/09/2019 15:39, Andrew Stubbs wrote:

On 23/09/2019 15:15, Segher Boessenkool wrote:

On Mon, Sep 23, 2019 at 11:56:27AM +0100, Andrew Stubbs wrote:

   [(set (match_operand:DI 0 "register_operand"  "=Sg,v")
 (ashift:DI
   (match_operand:DI 1 "gcn_alu_operand" " Sg,v")
   (match_operand:SI 2 "gcn_alu_operand" " Sg,v")))
    (clobber (match_scratch:BI 3 "=cs,X"))]



Unfortunately, the compiler (almost?) exclusively selects the second
alternative, even when this means moving the values from one register
file to the other, and then back again.

The problem is that the scalar instruction clobbers the CC register,
which results in a "reject++" for that alternative in the LRA dump.


What kind of reject?  It prints a reason, too.


  0 Non input pseudo reload: reject++


Apparently I was confused by operand "0" versus alternative "0". That 
message did occur, but it wasn't the only one. Here's all of it:


 0 Non input pseudo reload: reject++
 3 Scratch win: reject+=2
   alt=0,overall=9,losers=1,rld_nregs=2
   alt=1,overall=6,losers=1,rld_nregs=2

I don't understand why the "reject++" occurs, but presumably has to do 
with the "Sg" register availability somehow?


The "Scratch win" part comes from this code:

  /* We simulate the behavior of old reload here.
 Although scratches need hard registers and it
 might result in spilling other pseudos, no reload
 insns are generated for the scratches.  So it
 might cost something but probably less than old
 reload pass believes.  */
  if (scratch_p)
{
  if (lra_dump_file != NULL)
fprintf (lra_dump_file,
 "%d Scratch win: reject+=2\n",
 nop);
  reject += 2;
}
}

Would it make sense to skip this reject when CLASS_LIKELY_SPILLED_P, as 
Jeff suggested?


Unfortunately, removing the "Scratch win" penalty alone is not enough 
for LRA to select the first alternative -- at least, no in my testcase 
-- so I need to understand the "non input pseudo reload" issue as well. 
I can see why it fires for alt0, but not why it does not fire for alt1.


Andrew

Re: [RFC] Characters per line: from punch card (80) to line printer (132)

2019-12-05 Thread Andrew Stubbs


On 05/12/2019 16:17, Joseph Myers wrote:

Longer lines mean less space for multiple terminal / editor windows
side-by-side to look at different pieces of code.  I don't think that's an
improvement.


Here's a data-point 

My 1920 pixel-wide screen, in the default font, allows 239 columns; not 
enough for two 130-wide editors.  Especially not with line numbers and 
"gutter" columns.


On the other hand, 80 columns does tend to cause some formatting 
contortions, with long function names and deeper indentations.


I think a nice round 100 would be a good compromise.

Andrew

Re: [RFC] Characters per line: from punch card (80) to line printer (132)

2019-12-06 Thread Andrew Stubbs


On 05/12/2019 18:21, Robin Curtis wrote:

My IBM Selectric golfball electronic printer only does 90 characters on A4 in 
portrait mode………(at 10 cps)

(as for my all electric TELEX Teleprinter machine !)

Is this debate for real ?!  - or is this a Christmas spoof ?


I can't speak for the debate, but the pain is real.

Andrew

Register allocation trouble

2017-07-21 Thread Andrew Stubbs


Hi all,

I have an architecture that has two register files. Let's call them 
class A and class B. There are some differences between their 
capabilities, but for the purposes of this problem, they can be 
considered to be identical, both holding SImode values, and both able to 
receive values without a secondary reload.


In order to load data into A, the address register must also be in A. 
Similarly, in order to load data into B, the address register must also 
be in B.


So I set up my (simplified) pattern:

(set (match_operand:SI "register_operand" "=a,b")
 (match_operand:SI "memory_operand"   "Ra,Rb"))

where
  "a" is a register in A
  "b" is a register in B
  "Ra" is a mem with base address in A
  "Rb" is a mem with base address in B

(Obviously, there are stores and moves and whatnot too, but you get the 
idea.)


The problem is that the register allocator cannot see inside Ra and Rb 
to know that it must allocate the base register correctly. It only knows 
that, some of the time, the instruction "does not satisfy its constraints".


I believe the register allocator relies on base_reg_class to choose a 
class for the base register, but that just says "AB" (the union of class 
A and class B), because it has no context to say otherwise. Only the 
constraints define what hard registers are acceptable, and all they see 
is pseudoregs during reload.


Similarly for the other hooks I've investigated: they don't have enough 
context to know what's going on, and the rtx given always has pseudo 
registers anyway.


The only solutions I can think of are to either preallocate the loads to 
register classes via some means MODE_CODE_BASE_REG_CLASS can see 
(address spaces, perhaps), or to have an alternative that catches the 
mismatching cases and splits to a load and move, or set the base class 
to A and always load via secondary reload for B.


Any suggestions would be greatly appreciated.

Andrew

Re: Register allocation trouble

2017-07-24 Thread Andrew Stubbs


Thanks to all those who replied. :-)

Here's what I've done to fix the problem:

1. Set the base rclass to A only.

2. Configured secondary reloads to B via A.

3. Disabled the Rb constraint. [*]

That's enough to create correct code, but it's pretty horrible, so I 
also added new patterns of the form Nathan suggested so that the base 
register can be allocated directly, as an optimization. These occur 
before the main mov insn in the search order, and catch only valid MEMs 
that won't get meddled with, so I believe that the "one mov per mode" 
rule can be safely ignored. The main mov pattern can still do the 
loads/stores, via the secondary reloads, so I believe that to be safe too.


Thanks again

Andrew

[*] I've not removed it because actually it's still active for some 
address spaces, but that's a detail I glossed over previously.

Re: Register allocation trouble

2017-07-24 Thread Andrew Stubbs


On 24/07/17 14:58, Georg-Johann Lay wrote:

Dunno if that works in all situation.  For example, when the register
allocator is facing high register pressure and decides to spill the
target register, it uses the constraints of the matched insn.


That would be a memory to memory move, and therefore not valid in any 
mov insn on many architectures. Are you sure that's a real thing?


Confused now.

Andrew

AMD GCN port

2018-05-09 Thread Andrew Stubbs


Honza, Martin,

Further to our conversation on IRC ...

We have just completed work on a GCN3 & GCN5 port intended for running 
OpenMP and OpenACC offload kernels on AMD Fiji and Vega discrete GPUs. 
Unfortunately Carrizo is probably broken because we don't have one to 
test, and the APUs use shared memory and XNACK, which we've not paid any 
attention to.


There will be a binary release available soon(ish).

Apologies the development schedule has made it hard to push the work 
upstream, but now it is time.


I've posted the code to Github for reference:
 https://github.com/ams-cs/gcc
 https://github.com/ams-cs/newlib

We're using LLVM 6 for the assembler and linker; there's no binutils port.

It should be possible to build a "standalone" amdgcn-none-amdhsa 
compiler that can run code via the included "gcn-run" loader tool (and 
the HSA runtime). This can be used to run the testsuite, with a little 
dejagnu config trickery.


It should also be possible to build an x86_64-linux-gnu compiler with 
--enable-offload-target=gcn, and a matching amdgcn-none-amdhsa compiler 
with --enable-as-accelerator-for=x86_64-linux-gnu, and have them run 
code offloaded with OpenMP or OpenACC directives.


The code is based on Honza's original port, rebased to GCC 7.3.

I'd like to agree an upstreaming strategy that
a) gets basic GCN support into trunk soonish. We'll need to get a few 
middle/front end patches approved, and probably update a few back-end 
hooks, but this ought to be easy enough.
b) gets trunk OpenMP/OpenACC to work for GCN, eventually. I'm expecting 
some pain in libgomp here.
c) gives me a stable base from which to make binary releases (i.e. not 
trunk).
d) allows me to use openacc-gcc-8-branch without too much duplication of 
effort.


How about the following strategy?

1. Create "gcn-gcc-7-branch" to archive the current work. This would be 
a source for merges (or cherry-picking), but I'd not expect much future 
development. Initially it would have the same content as the Github 
repository above.


2. Create "gcn-gcc-8-branch" with a merger of "gcc-8-branch" and 
"gcn-gcc-7-branch". This would be broken w.r.t. libgomp, initially, but 
get fixed up in time. It would receive occasional merges from the 
release branch. I expect to do GCN back-end development work here.


3. Create "gcn-openacc-gcc-8-branch" from the new "gcn-gcc-8-branch", 
and merge in "openacc-gcc-8-branch". This will hold offloading patches 
not compatible with trunk, and receive updated GCN changes via merge. I 
intend to deliver my next binary release from this branch.


4. Replace/update the existing "gcn" branch with a merger of "trunk" and 
"gcn-gcc-8-branch" (not the OpenACC branch). This would be merged to 
trunk, and possibly retired, as soon as possible. I imagine bits will 
have to be submitted as patches, and then the back-end merged as a whole.


trunk
 |\
 | gcc-7-branch
 | |\
 | : gcn-gcc-7-branch
 |  \
 |\  '.
 | gcc-8-branch   |
 | | \ '. |
 | :  openacc-gcc-8-branch  gcn-gcc-8-branch
 |   \   / |
 |  gcn-openacc-8-branch   |
 |\  ,-'
 | gcn
 |/
gcc-9

It's slightly complex to describe, but hopefully logical and workable.

Comments? Better suggestions?

--
Andrew Stubbs
Mentor Graphics / CodeSourcery.

Re: AMD GCN port

2018-05-11 Thread Andrew Stubbs


On 11/05/18 10:26, Richard Biener wrote:

Sounds good but I'd not do 1. given the github repo can serve as archiving
point, too.  Having 2. doesn't sound too useful over 3. so in the end I'd
do only 3. and 4.  Of course 1 and 2 might help you in doing 3 and 4.


Indeed, I've been worried that I'm basically planning to expose internal 
steps.


The problem I'm trying to solve with 2 is that what I need is 3, but 
that means code dependencies on things I don't own, which makes it 
harder to get to 4.



The other thing that's occurred to me is that with og8 being new, maybe 
it's a good time to merge the GCN stuff into that, and work with the 
NVidia folks to share it. [Adding Cesar and Thomas to CC.] I'm aware of 
some incompatibilities with og7, but those are going to need fixing 
either way.


Here's another proposal.

trunk
 |\
 | gcc-7-branch
 | |\
 | : gcn-gcc-7-branch (1 - possibly notional)
 | \
 |\ |
 | gcc-8-branch |
 | |\  /
 | | gcn-gcc-8-branch (2. trunk compatible)
 | |   |   '.
 | |\  ||
 | : openacc-gcc-8-branch (3. share existing)   |
 |  |
 |\  ,--'
 | gcn (4. temporary)
 |/
gcc-9


Obviously, the description "trunk compatible" would become less true 
over time, but it will be less diverged than og8. I suppose this branch 
could also be notional, only named internally, though?


I guess it makes no difference to me -- I'm going to have to go through 
all the steps anyway -- but it depends how transparent others would like 
me to be.


Andrew

Re: AMD GCN port

2018-05-11 Thread Andrew Stubbs


On 11/05/18 12:18, Andrew Stubbs wrote:
The other thing that's occurred to me is that with og8 being new, maybe 
it's a good time to merge the GCN stuff into that, and work with the 
NVidia folks to share it. [Adding Cesar and Thomas to CC.] I'm aware of 
some incompatibilities with og7, but those are going to need fixing 
either way.


I've spoken with Thomas. He's happy to take GCN patches there, or GCN 
related patches anyway, so that's an option.


I can use it as my upstream, or push my changes directly.

I guess it makes no difference to me -- I'm going to have to go through 
all the steps anyway -- but it depends how transparent others would like 
me to be.


The more I think about this, the more I'm coming to the conclusion that 
nobody but me really cares about GCN for GCC 8, and nobody cares about 
the development history, so I'm over-complicating my upstreaming plans.


I should just do what I need to do in my local repo, as before, and 
submit a somewhat flattened patch series for inclusion in trunk in the 
traditional manner, as soon as I have it.


Andrew

Vector pointer modes

2018-05-16 Thread Andrew Stubbs


Hi all,

I'm in the process of trying to update our AMD GCN port from GCC 7 to 
GCC 8+, but I've hit a problem ...


It seems there's a new assumption that pointers and addresses will be 
scalar, but GCN load instructions require vectors of pointers. 
Basically, machine_mode has been replaced with scalar_int_machine mode 
in many places, and we were relying on vector modes being allowed.


The changes are all coming from the Richard Sandiford's SVE patches.

Is there a new way of dealing with vectors of pointers?

Thanks

Andrew

Re: Vector pointer modes

2018-05-16 Thread Andrew Stubbs


On 16/05/18 17:24, Richard Biener wrote:

On May 16, 2018 6:03:35 PM GMT+02:00, Andrew Stubbs  
wrote:

Is there a new way of dealing with vectors of pointers?


Maybe you can masquerade it behind a large scalar integer mode?...


We're using V64DImode to represent a vector of 64 64-bit pointers. The 
architecture can hold this in a pair of V64SImode registers; it is not 
equivalent to 128 consecutive smaller registers, like NEON does.


We could use plain DImode to get the same effect from print_operand, but 
that then chooses the wrong alternative, or whole wrong insn pattern and 
bad things would happen.


Or, do you mean something else?

Andrew

Re: Vector pointer modes

2018-05-17 Thread Andrew Stubbs


On 16/05/18 22:01, Richard Sandiford wrote:

Andrew Stubbs  writes:

Hi all,

I'm in the process of trying to update our AMD GCN port from GCC 7 to
GCC 8+, but I've hit a problem ...

It seems there's a new assumption that pointers and addresses will be
scalar, but GCN load instructions require vectors of pointers.
Basically, machine_mode has been replaced with scalar_int_machine mode
in many places, and we were relying on vector modes being allowed.

The changes are all coming from the Richard Sandiford's SVE patches.


FWIW, I think that assumption was always there.  The scalar_int_mode
patches just made it more explicit (as in, more code would fail to
build if it wasn't honoured, rather than just potentially ICEing).


It was fine if done late enough, but now it's just blocked in 
TARGET_ADDR_SPACE_POINTER_MODE et al.


However, having now finished a first rough forward-port (with the 
relevant bits of these hooks commented out and gcc_unreachable), I find 
that vector loads and stores are working perfectly, and there are no 
related ICEs in the testsuite (although, with vector widths less than 64 
still on the to-do list, a lot of the testsuite doesn't do much 
vectorizing).



Is this mostly about the RTL level concept of an address or pointer?
If so, in what situations do you need the address or pointer itself to
be a vector?  SVE and AVX use unspecs for gathers and scatters, and I
don't think in practice we lose anything by doing that.


As far as the ISA is concerned, *all* vector loads and stores are 
scatter/gather.


In our port we model a normal, contiguous vector load/store as a DImode 
base pointer until reload_completed, and then have a splitter expand 
that into a V64DImode with the appropriate set of lane addresses. 
Ideally this would happen earlier, so as to allow CSE to optimize the 
expansion, but we've not got there yet (and, as you say, would probably 
hit trouble).


Andrew

DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


Hi All,

I'm trying to implement maskload/maskstore for AMD GCN, which has up-to 
64-lane, 512-byte fully-masked vectors. All seems fine as far as the 
vector operations themselves go, but I've found a problem with the RTL 
Dead Store Elimination pass.


Testcase gcc.c-torture/execute/20050826-2.c uses a maskstore to write 
the 14 DImode pointers all in one go. The problem is that DSE doesn't 
know that the store is masked and judges the width at 512 bytes, not the 
true 56 bytes. This leads it to eliminate prior writes to nearby stack 
locations, and therefore bad code.


Has anyone encountered this problem with SVE or AVX maskstore at all?

I was thinking of solving the problem by adding a target hook to query 
the true length of vector types. Any thoughts on that?


Andrew

Re: DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


On 03/07/18 11:15, Richard Biener wrote:

AVX ones are all UNSPECs I believe - how do your patterns look like?


AVX has both unspec and vec_merge variants (at least for define_expand, 
in GCC8), but in any case, AFAICT dse.c only cares about the destination 
MEM, and all the AVX and SVE patterns appear to use nothing special there.



I was thinking of solving the problem by adding a target hook to query
the true length of vector types. Any thoughts on that?


It isn't about the length but about the mask because there can be mask
values that do not affect the length?


The problem I have right now is that the vector write conflicts with 
writes to distinct variables, in which case the vector length is what's 
important, and it's probably(?) safe to assume that if the vector mask 
is not constant then space for the whole vector has been allocated on 
the stack.


But yes, in general it's true that subsequent writes to the same vector 
could well write distinct elements, in which case the value of the mask 
is significant to DSE analysis.


Andrew

Re: DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


On 03/07/18 11:33, Andrew Stubbs wrote:

On 03/07/18 11:15, Richard Biener wrote:

AVX ones are all UNSPECs I believe - how do your patterns look like?


AVX has both unspec and vec_merge variants (at least for define_expand, 
in GCC8), but in any case, AFAICT dse.c only cares about the destination 
MEM, and all the AVX and SVE patterns appear to use nothing special there.


Sorry, my patterns look something like this:

(set (mem:V64DI (reg:DI)
 (vec_merge:V64DI (reg:V64DI) (unspec ...) (reg:DI)))

Where the unspec just means that the destination remains unchanged. We 
could also use (match_dup 0) there, but we don't, so probably there was 
an issue with that at some point.



I was thinking of solving the problem by adding a target hook to query
the true length of vector types. Any thoughts on that?


It isn't about the length but about the mask because there can be mask
values that do not affect the length?


The problem I have right now is that the vector write conflicts with 
writes to distinct variables, in which case the vector length is what's 
important, and it's probably(?) safe to assume that if the vector mask 
is not constant then space for the whole vector has been allocated on 
the stack.


But yes, in general it's true that subsequent writes to the same vector 
could well write distinct elements, in which case the value of the mask 
is significant to DSE analysis.


Andrew

Re: DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


On 03/07/18 12:02, Richard Biener wrote:

I believe that the AVX variants like

(define_expand "maskstore"
   [(set (match_operand:V48_AVX512VL 0 "memory_operand")
 (vec_merge:V48_AVX512VL
   (match_operand:V48_AVX512VL 1 "register_operand")
   (match_dup 0)
   (match_operand: 2 "register_operand")))]
   "TARGET_AVX512F")

are OK since they represent a use of the memory due to the match_dup 0
while your UNSPEC one doesn't so as the store doesn't actually take
place to all of 0 your insn variant doesn't represent observable behavior.


Hmm, so they're safe, but may prevent the optimization of nearby variables?

What about the unspec AVX variant?

Andrew

Re: DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


On 03/07/18 12:30, Richard Biener wrote:

Hmm, so they're safe, but may prevent the optimization of nearby variables?


Yes, they prevent earlier stores into lanes that are "really" written
to to be DSEd.


Right, but I have unrelated variables allocated to the stack within the 
"shadow" of the masked vector. I didn't ask it to do that, it just does, 
so I presume this is an expect feature of masked vectors with a known mask.


Surely this prevents valid optimizations on those variables?

Andrew

Re: DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


On 03/07/18 12:45, Richard Biener wrote:

On Tue, Jul 3, 2018 at 1:38 PM Andrew Stubbs  wrote:


On 03/07/18 12:30, Richard Biener wrote:

Hmm, so they're safe, but may prevent the optimization of nearby variables?


Yes, they prevent earlier stores into lanes that are "really" written
to to be DSEd.


Right, but I have unrelated variables allocated to the stack within the
"shadow" of the masked vector. I didn't ask it to do that, it just does,
so I presume this is an expect feature of masked vectors with a known mask.


Huh, I don't think so.  I guess that's the real error and I wonder how
that happens.
Are those just spills or real allocations?


The code from the testcase looks like this:

struct rtattr rt[2];
struct rtattr *rta[14];
int i;

rt[0].rta_len = sizeof (struct rtattr) + 8;
rt[0].rta_type = 0;
rt[1] = rt[0];
for (i = 0; i < 14; i++)
  rta[i] = &rt[0];

The initialization of rt[0] and rt[1] are being deleted because the 
write to rta[0..13] would overwrite rt if it had actually been the 
maximum rta[0..63].


That, or I've been staring at dumps too long and gone crazy.

Andrew

P.S. I'm trying to test with (match_dup 0), but LRA exploded.

Re: DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


On 03/07/18 13:21, Richard Biener wrote:

Ok, so if we vectorize the above with 64 element masked stores
then indeed the RTL representation is _not_ safe.  That is because
while the uses in the masked stores should prevent things from
going bad there is also TBAA to consider which means those
uses might not actually _be_ uses (TBAA-wise) of the earlier
stores.  In the above case rtattr * doesn't alias int (or whatever
types rta_type or rta_len have).  That means to DSE the earlier
stores are dead.


I managed to get it to generate maskstore without the unspec, and the 
code now runs correctly.


I don't follow your AA reasoning. You say the use stops it being bad, 
and then you say the stores are dead, which sounds bad, yet it's not 
deleting them now.


Confused. :-(

Andrew

Re: DSE and maskstore trouble

2018-07-03 Thread Andrew Stubbs


On 03/07/18 14:52, Richard Biener wrote:

If you look at RTL dumps (with -fstrict-aliasing, thus -O2+) you should
see MEM_ALIAS_SETs differing for the earlier stores and the masked
store uses.

Now I'm of course assuming DSE is perfect, maybe it isn't ... ;)


Ok, I see that the stores have MEMs with different alias sets, indeed. I 
can't quite work out if that means it's safe, or unsafe? Do I still need 
to zero the set?


For masked stores, clearly the current DSE implementation must be 
sub-optimal because it ignores the mask. Writing it as a 
load/modify/write means that stores are not erroneously removed, but 
also means that redundant writes (to masked vector destinations) are 
never removed. Anyway, at least I know how to make that part safe now.


Thanks

Andrew

Re: Triplet for ARM Linux HardFP ABI, again

2011-03-01 Thread Andrew Stubbs


On 21/02/11 10:12, Guillem Jover wrote:

This was already discussed in this list some time ago [0]. But it came
up again when restarting the discussion for the proposed new armhf port
for Debian.

   [0]

My arguments for why a distinct triplet is needed can be found in [1],
it's a big long though. Most of the points there revolve around the
fact that we rely on the toolchains as configured by_default_  to
produce the expected output targetting a concrete architecture, it
also has implications for the file system paths.

   [1]

It seems from reading the past discussion on this list that the main
objection was that the triplet should not be used to decide what
floating point ABI to use in gcc. No problem with that!


Up front, let me say I disagree with the previous finding that the 
triplet isn't the right place for this kind of thing. It's clear to me 
that there's plenty of prior art here, and it would work for us very 
nicely, thank you very much. That said, there are down-sides to 
target-triplets (not least that once you've chosen one, you find 
yourself stuck with it for backward compatibility, even after it makes 
no sense any more), and many people seem to believe it would be better 
if they had never been invented, so .


I fail to see how abusing the OS/ABI field is any better than abusing 
the vendor field?


The patch you posted is surely just the tip of the iceberg - there are 
thousands of packages in Debian, and any one of them might need 
adjustment to cope with this change.


When I proposed a new triplet before, in the thread you referenced 
above, I proposed having an 'official' name that everyone would agree 
on. That would have been disruptive. Your triplet would be unofficial, 
so I would say it would be hard to justify all that disruption. In the 
worst case, third parties would start to use your unofficial triplet 
inconditionally, and would need fixing up to work with anything that is 
not Debian.


In July's thread, it was decided (sort of) that the compiler should not 
choose its (micro-)configuration based on the triplet. I didn't really 
agree with that, but there it is. You've decided to stick with that, and 
have the triplet influence only your build-system. Surely that's exactly 
what the 'vendor' field is for? It seems like (it has to be) a 
vendor-specific configuration to me.


Adjusting the vendor field should not break any of those thousands of 
packages (although, no doubt there'll be the odd one or two). It will 
give you your differentiated pathnames. It will tell your build-system 
what to do. Why do it the hard way if there is no advantage?


Andrew

Re: running GCC without any input files, but with plugins???

2011-04-08 Thread Andrew Stubbs


On 08/04/11 07:04, Basile Starynkevitch wrote:

So I am dreaming of a way to run gcc with cc1 but without input files.
Perhaps something like
   gcc -fplugin=foo.so -fplugin-arg-foo-bar=bee -frun-cc1-without-input


gcc -fplugin=foo.so -fplugin-arg-foo-bar=bee -x c /dev/null

Andrew

Re: Syncing with Launchpad Bug Tracker

2011-05-03 Thread Andrew Stubbs


On 27/04/11 18:29, Deryck Hodge wrote:

I work at Canonical on Launchpad and am trying to setup syncing
between our bug tracker and the GCC bug tracker.  Specifically, we
want to enable comment syncing between linked bugs on our trackers and
back links from your Bugzilla to the Launchpad bug.  Currently we only
sync status and importance from your tracker to ours.  We need
credentials for an account on your tracker to setup this additional
syncing.  If we had such credentials, we would sync comments and back
links only if a Launchpad bug is linked to a GCC Bugzilla bug.  We
store these credentials in private configs on Launchpad.  We have this
setup for a number of other trackers.  The Mozilla Bugzilla comes to
mind as a tracker we do this with already.


Will it be possible to post comments to Launchpad, and not have them 
automatically posted to bugzilla?


I sometimes post Linaro-specific details to the Launchpad bug which 
would be meaningless noise in upstream bugzilla. The bug is linked 
because the problem is the same, but typically the source-base is 
different, and/or we want to add additional tracking details, or whatever.


e.g. I don't think bugzilla readers are interested in a comment like 
"Bug targeted at Linaro x.x release", or "Patch committed to 
lp:gcc-linaro/4.5 revision 99456".


Andrew

Re: [SH] ICE compiling pr34330 testcase for sh-linux-gnu

2009-07-30 Thread Andrew Stubbs


On 09/07/09 19:11, Ian Lance Taylor wrote:

Andrew Stubbs  writes:


The problem insn is created by gen_reload when it is given the
following rtl as input:

(plus:SI (plus:SI (reg/v/f:SI 4 r4 [orig:192 a ] [192])
 (const_int 2 [0x2]))
 (reg:SI 0 r0 [orig:188 ivtmp.24 ] [188]))


You need to backtrack before that point to see why find_reloads let that
go through.


OK, I've gone through it all trying to understand it, but this is fairly 
complex code and I don't claim to get it.


Here's my analysis of the problem.

The problem instruction in pr34330.c.181r.sched1 (the state before 
register allocation/reload) is:


(insn 97 103 111 5 .../pr34330.c:17 (set (mem/s:HI (reg/f:SI 239 [ 
D.1960 ]) [6 D.1960_8->s1+0 S2 A16])
(subreg:HI (reg:SI 257) 2)) 187 {movhi_i} (expr_list:REG_DEAD 
(reg:SI 257)

(nil)))

The reloads for this instruction are:

Reloads for insn # 97
Reload 0: reload_in (SI) = (plus:SI (reg/v/f:SI 4 r4 [orig:250 a ] [250])
(const_int 2 [0x2]))
GENERAL_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 1)
reload_in_reg: (plus:SI (reg/v/f:SI 4 r4 [orig:250 a ] [250])
(const_int 2 [0x2]))
 reload_reg_rtx: (reg:SI 8 r8)
Reload 1: reload_in (SI) = (plus:SI (plus:SI (reg/v/f:SI 4 r4 [orig:250 
a ] [250])
(const_int 2 
[0x2]))
(reg:SI 0 r0 
[orig:243 ivtmp.11 ] [243]))

GENERAL_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 1)
reload_in_reg: (plus:SI (plus:SI (reg/v/f:SI 4 r4 [orig:250 a ] 
[250])
(const_int 2 
[0x2]))
(reg:SI 0 r0 
[orig:243 ivtmp.11 ] [243]))

reload_reg_rtx: (reg:SI 9 r9)
Reload 2: reload_out (HI) = (mem/s:HI (reg/f:SI 2 r2 [orig:239 D.1960 ] 
[239]) [6 D.1960_8->s1+0 S2 A16])

NO_REGS, RELOAD_FOR_OUTPUT (opnum = 0), optional
reload_out_reg: (mem/s:HI (reg/f:SI 2 r2 [orig:239 D.1960 ] 
[239]) [6 D.1960_8->s1+0 S2 A16])
Reload 3: reload_in (HI) = (mem:HI (plus:SI (plus:SI (reg/v/f:SI 4 r4 
[orig:250 a ] [250])
(const_int 
2 [0x2]))
(reg:SI 0 r0 
[orig:243 ivtmp.11 ] [243])) [3 S4 A32])

GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1), can't combine
reload_in_reg: (subreg:HI (reg:SI 257) 2)
reload_reg_rtx: (reg:HI 8 r8)


Which results in this instruction sequence in pr34330.c.183r.ira:

(insn 169 103 170 4 .../pr34330.c:17 (set (reg:SI 8 r8)
(const_int 2 [0x2])) 175 {movsi_ie} (nil))

(insn 170 169 171 4 .../pr34330.c:17 (set (reg:SI 8 r8)
(plus:SI (reg:SI 8 r8)
(reg/v/f:SI 4 r4 [orig:250 a ] [250]))) 35 
{*addsi3_compact} (expr_list:REG_EQUIV (plus:SI (reg/v/f:SI 4 r4 
[orig:250 a ] [250])

(const_int 2 [0x2]))
(nil)))

(insn 171 170 172 4 .../pr34330.c:17 (set (reg:SI 9 r9)
(plus:SI (reg:SI 8 r8)
(reg:SI 0 r0 [orig:243 ivtmp.11 ] [243]))) 35 
{*addsi3_compact} (nil))


(insn 172 171 97 4 .../pr34330.c:17 (set (reg:HI 8 r8)
(mem:HI (reg:SI 9 r9) [3 S4 A32])) 187 {movhi_i} (nil))

(insn 97 172 111 4 .../pr34330.c:17 (set (mem/s:HI (reg/f:SI 2 r2 
[orig:239 D.1960 ] [239]) [6 D.1960_8->s1+0 S2 A16])

(reg:HI 8 r8)) 187 {movhi_i} (nil))


The problem is in insn 171 where r9 != r8 and therefore cannot match an 
SH 2-operand add instruction.


Looking back at how this happens, insn 171 is created by gen_reload from 
reload #1 and initially looks like this:


(insn 171 170 172 5 .../pr34330.c:17 (set (reg:SI 9 r9)
(plus:SI (plus:SI (reg/v/f:SI 4 r4 [orig:250 a ] [250])
(const_int 2 [0x2]))
(reg:SI 0 r0 [orig:243 ivtmp.11 ] [243]))) -1 (nil))

This is then transformed by subst_reloads to the final broken form:

(insn 171 170 172 5 .../pr34330.c:17 (set (reg:SI 9 r9)
(plus:SI (reg:SI 8 r8)
(reg:SI 0 r0 [orig:243 ivtmp.11 ] [243]))) -1 (nil))

This is logically correct as r9 genuinely does contain the result of the 
substituted expression, but it does not satisfy the constraints.


And here's where I get stuck. I don't know where in the code it's 
supposed to check that the code will be correct after the substitution.


The (plus (plus ...) ...) rtl is generated by 
find_reloads_subreg_address using make_memloc and plus_constant, and it 
seems correct in itself.


Any help would be appreciated.

Thanks

Andrew

Aliasing bug

2009-07-02 Thread Andrew Stubbs


Hi all,

I'm fairly sure I have found an aliasing bug in GCC, although I could be 
wrong. I've reproduced it in both 4.4 and mainline.


Consider this testcase, aliasing.c:

extern void *foo;

extern inline short **
f1 (void)
{
  union
{
  void **v;
  short **s;
} u;

  u.v = (&foo);
  if (*u.s == 0) *u.s = (short *)42;
  return u.s;
}

const short *a, *b;

int
f ()
{
  a = *f1();
  b = *f1();
}

The (not very useful) testcase initialises foo to 42, if necessary, and 
then sets both 'a' and 'b' to equal foo. There should be no way that 'a' 
and 'b' can ever be set to zero.


Compile the code as follows:

  sh-linux-gnu-gcc -c aliasing.c -O2 -fdump-tree-all

The dump file aliasing.c.133t.optimized (the last tree dump) then contains:

f ()
{
  void * foo.3;
  short int * D.1982;
  short int * * D.1973;

:
  foo.3_10 = foo;
  D.1982_26 = (short int *) foo.3_10;
  if (D.1982_26 == 0B)
goto ;
  else
goto ;

:
  D.1973_13 = (short int * *) &foo;
  *D.1973_13 = 42B;
  a = 0B;

:
  b = D.1982_26;
  return;

:
  a = D.1982_26;
  goto ;

}

This is the state of the code after the tree optimisations. Both 'a' and 
'b' are set to the initial value of foo, before it was initialised. Not 
only that, but 'a' is explicitly set to zero.


This problem goes away if -fno-strict-aliasing is used.

Is this a compiler bug? Or have I got something wrong in my code?

Thanks

Andrew

Re: Aliasing bug

2009-07-02 Thread Andrew Stubbs


On 02/07/09 14:26, Richard Guenther wrote:

You are writing to memory of type void * via an lvalue of type short *.


Yes, there is type punning there, but that should work, shouldn't it?

This code is distilled from some glibc code I'm having trouble with.

Andrew

Re: Aliasing bug

2009-07-02 Thread Andrew Stubbs


On 02/07/09 14:34, Richard Guenther wrote:

No, that's invalid.  You would have to do

extern union {
   void *foo;
   short *bar;
};

using the union for the double-indirect pointer doesn't help.  Or
simply use memcpy to store to foo.


Ah, I did not know that. I still don't understand how a reference to a 
memory location that happens to contain a pointer is different to one 
what contains other data?


Anyway, I see that the glibc code has, in fact, already been fixed here: 
http://sourceware.org/ml/libc-alpha/2008-11/msg4.html


Thank you.

Andrew

[SH] ICE compiling pr34330 testcase for sh-linux-gnu

2009-07-09 Thread Andrew Stubbs


I'm having trouble with an ICE, and I'm hoping somebody can enlighten me.

Given the following command:

cc1 -fpreprocessed ../pr34330.i -quiet -dumpbase pr34330.c -da -mb 
-auxbase-strip pr34330.c -Os -version -ftree-parallelize-loops=4 
-ftree-vectorize -o pr34330.s -fschedule-insns


I get an internal compiler error:

GNU C (GCC) version 4.5.0 20090702 (experimental) (sh-linux-gnu)
compiled by GNU C version 4.3.2, GMP version 4.3.1, MPFR 
version 2.4.1-p5, MPC version 0.6

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
GNU C (GCC) version 4.5.0 20090702 (experimental) (sh-linux-gnu)
compiled by GNU C version 4.3.2, GMP version 4.3.1, MPFR 
version 2.4.1-p5, MPC version 0.6

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Compiler executable checksum: c91a929a0209c0670a3ae8b8067b9f9a
/scratch/ams/4.4-sh-linux-gnu-lite/src/gcc-trunk-4.4/gcc/testsuite/gcc.dg/torture/pr34330.c: 
In function 'foo':
/scratch/ams/4.4-sh-linux-gnu-lite/src/gcc-trunk-4.4/gcc/testsuite/gcc.dg/torture/pr34330.c:22:1: 
error: insn does not satisfy its constraints:
(insn 171 170 172 4 
/scratch/ams/4.4-sh-linux-gnu-lite/src/gcc-trunk-4.4/gcc/testsuite/gcc.dg/torture/pr34330.c:17 
(set (reg:SI 9 r9)

(plus:SI (reg:SI 8 r8)
(reg:SI 0 r0 [orig:243 ivtmp.11 ] [243]))) 35 
{*addsi3_compact} (nil))
/scratch/ams/4.4-sh-linux-gnu-lite/src/gcc-trunk-4.4/gcc/testsuite/gcc.dg/torture/pr34330.c:22:1: 
internal compiler error: in reload_cse_simplify_operands, at 
postreload.c:396

Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.

The problem is that r8 != r9 but SH requires that it is.

The problem insn is created by gen_reload when it is given the following 
rtl as input:


(plus:SI (plus:SI (reg/v/f:SI 4 r4 [orig:192 a ] [192])
(const_int 2 [0x2]))
(reg:SI 0 r0 [orig:188 ivtmp.24 ] [188]))

The problem appears to be that the nested plus does not match any of the 
patterns it recognizes so it falls through to the final else clause:


  /* Otherwise, just write (set OUT IN) and hope for the best.  */
  else
emit_insn (gen_rtx_SET (VOIDmode, out, in));

... which doesn't even attempt to check the constraints.

Is this an unexpected corner case for reload? Or is the input RTL 
mal-formed somehow?


This case fails in both GCC 4.4 and SVN trunk (although the latter has 
disabled -fschedule-insns by default so it needs to be re-enabled 
explicitly).


Thanks for an help.

Andrew

Triplet for ARM Linux HardFP ABI

2010-07-12 Thread Andrew Stubbs


Hi All,

Both Linaro and Debian are considering supporting the ARM hard-float 
variant of the EABI, at least as an unofficial port. This ABI is not 
compatible with the gnueabi currently in use for most ARM Linux distros, 
but has a number of performance advantages.


This means that we need to choose a name for it. Obviously, it's better 
if it's an "official" name, so I want to discuss it here. I'm aware that 
there is some bikeshedding to do here, but it's better it gets done 
before anybody gets stuck with something else. There are, of course, 
some real practical reasons why one name might be better than another.


So here are my suggestions:

  arm-linux-gnueabihf
   or maybe
  arm-linux-gnueabi-hf

These will match any package that uses arm*-*-linux-gnueabi*.
Choosing which variant is mainly a matter of taste.

  arm-linux-gnuhfeabi

These will match any package that uses arm*-*-linux-*eabi (as I
see gcc itself does).

I'm not sure which is better. I suspect that, either way, a lot of 
things will need to be fixed up.


An alternative would be to use the vendor field. That would be less 
difficult, but it feels like something of a hack to me.


FAOD, the new triplet would only set the default ABI variant. This can 
already be achieved via configure options, so this adds no real new 
functionality. This is just about agreeing how to label it.


Andrew Stubbs
CodeSourcery (currently working with Linaro)

Re: Triplet for ARM Linux HardFP ABI

2010-07-12 Thread Andrew Stubbs


On 12/07/10 15:51, Richard Earnshaw wrote:

if we could turn back the clock, I'd even prefer

arm-linux_gnu_hf_eabi to get back to a single '-'-parsed OS string, but
the linux-gnu stuff is now entrenched, so trying to change back would only
cause more disruption.


quadruplets, quintuplets and even sextuplets wouldn't be a problem if
all the preceding parts were compulsory.  The problem is that the vendor
field is optional, so now arm-linux-gnueabi is ambiguous.  Is that a
quadruplet that's missed out the vendor part, or a triplet?


Right, hence arm-none-linux-gnueabi. I forgot about that, but I'm only 
really talking about the OS part here anyway.


Andrew

Re: Triplet for ARM Linux HardFP ABI

2010-07-15 Thread Andrew Stubbs


On 12/07/10 14:25, Andrew Stubbs wrote:

This means that we need to choose a name for it. Obviously, it's better
if it's an "official" name, so I want to discuss it here. I'm aware that
there is some bikeshedding to do here, but it's better it gets done
before anybody gets stuck with something else. There are, of course,
some real practical reasons why one name might be better than another.


So, it seems this issue is not as simple as I thought. :(

Opinion seem to be somewhat divided, but in the absence of any sort of 
consensus, I suppose I'll have to propose that the various projects use 
the vendor field.


The alternative would be to add a configure test that checked the 
defaults in the existing host compiler, and duplicated the defaults 
somehow, but that sounds somewhat icky.


Andrew

Re: define_split

2010-11-10 Thread Andrew Stubbs


On 09/11/10 22:54, Michael Meissner wrote:

The split pass would then break this back into three insns:

(insn ... (set (reg:SF ACC_REGISTER)
   (mult:SF (reg:SF 124)
(reg:SF 125

(insn ... (set (reg:SF ACC_REGISTER)
   (plus:SF (reg:SF ACC_REGISTER)
(reg:SF 127

(insn ... (set (reg:SF 126)
   (reg:SF ACC_REGISTER)))

Now, if you just had the split and no define_insn, combine would try and form
the (plus (mult ...)) and not find an insn to match, so while it had the
temporary insn created, it would try to immediately split the insn, so at the
end of the combine pass, you would have:

(insn ... (set (reg:SF ACC_REGISTER)
   (mult:SF (reg:SF 124)
(reg:SF 125

(insn ... (set (reg:SF ACC_REGISTER)
   (plus:SF (reg:SF ACC_REGISTER)
(reg:SF 127

(insn ... (set (reg:SF 126)
   (reg:SF ACC_REGISTER)))


I'm trying to follow this example for my own education, but these two 
example results appear to be identical.


Presumably this isn't deliberate?

Andrew

Re: combine two load insns

2010-12-12 Thread Andrew Stubbs

On 08/12/10 14:39, Jeff Law wrote:
>> Sorry, I think I wasn't clear. I didn't mean constraints in term on
>> RTL template constraints, but 'constraints' coming from the new DI
>> destination of the load. More specifically: 2 SI loads can target
>> totally independent registers whereas a standard DI load must target a
>> contiguous SI register pair. If you don't do that before IRA, it will
>> most likely be impossible to do cleanly, won't it?
>
> I tend to look at it the other way -- prior to allocation & reload
> you're going to have two SImode pseudos and there's no way to guarantee
> they'll end up in consecutive hard registers.  You'd have to create a
> new DImode pseudo as the destination of the memory load, then copy from
> the DImode pseudo into the two SImode pseudos and rely on the register
> allocator to allocate the DImode pseudo to the same hard registers as
> the two SImode pseudos.  There's no guarantee that'll happen (it often
> will, but in the cases where it doesn't you end up with useless copies).

Fred's approach seems like the obviously 'better' way to me, although no 
doubt Jeff's way is easier to handle.

I've been musing about the best way to handle the ARM ldm/stm 
instructions that can coalesce an arbitrary number of register 
loads/stores into a single instruction. The only constraint is that the 
values must always appear in the same order in memory as in the register 
file. There is no requirement for contiguity.

Right now, the compiler does support ldm/stm, but only to a limited 
extent. I've yet to absorb it fully, but I think it's limited to 4 
registers at once, and relies on the registers being allocated helpfully.

It would be nice if the register allocator could be ldm/stm aware, 
somehow. It would also be nice if this could be done in such a way that 
ldm/stm could be used in asm inserts, given the right constraints.

It would be nice ...

Andrew

Re: Enable the vectorizer at -O2 for GCC 12

2021-09-01 Thread Andrew Stubbs


On 31/08/2021 05:13, Jeff Law wrote:



On 8/30/2021 9:30 PM, Hongtao Liu via Gcc wrote:
On Tue, Aug 31, 2021 at 11:11 AM Kewen.Lin via Gcc  
wrote:

on 2021/8/30 下午10:11, Bill Schmidt wrote:

On 8/30/21 8:04 AM, Florian Weimer wrote:

There has been a discussion, both off-list and on the gcc-help mailing
list (“Why vectorization didn't turn on by -O2”, spread across several
months), about enabling the auto-vectorizer at -O2, similar to what
Clang does.

I think the review concluded that the very cheap cost model should be
used for that.

Are there any remaining blockers?

Hi Florian,

I don't think I'd characterize it as having blockers, but we are 
continuing to investigate small performance issues that arise with 
very-cheap, including some things that regressed in GCC 12.  Kewen 
Lin is leading that effort.  Kewen, do you feel we have any major 
remaining concerns with this plan?



Hi Florian & Bill,

There are some small performance issues like PR101944 and PR102054, and
still two degraded bmks (P9 520.omnetpp_r -2.41% and P8 526.blender_r
-1.31%) to be investigated/clarified, but since their performance 
numbers

with separated loop and slp vectorization options look neutral, they are
very likely noises.  IMHO I don't think they are/will be blockers.

So I think it's good to turn this on by default for Power.

The intel side is also willing to enable O2 vectorization after
measuring performance impact for SPEC2017 and eembc.
Meanwhile we are investigating PR101908/PR101909/PR101910/PR92740
which are reported O2 vectorization regresses extra benchmarks on
znver and kabylake.
We'd like to see it on for our processor as well.  Though I don't have 
numbers I can share at this time.


AMD GCN probably ought to have it on too, possibly set to maximum ... a 
GPU without vectors is pretty terrible.


Andrew

Complex multiply optimization working?

2022-04-11 Thread Andrew Stubbs


Hi all,

I've been looking at implementing the complex multiply patterns for the 
amdgcn port, but I'm not getting the code I was hoping for. When I try 
to use the patterns on x86_64 or AArch64 they don't seem to work there 
either, so is there something wrong with the middle-end? I've tried both 
current HEAD and GCC 11.


The example shown in the internals manual is a simple loop multiplying 
two arrays of complex numbers, and writing the results to a third. I had 
expected that it would use the largest vectorization factor available, 
with the real/imaginary numbers in even/odd lanes as described, but the 
vectorization factor is only 2 (so, a single complex number), and I have 
to set -fvect-cost-model=unlimited to get even that.


I tried another example with SLP and that too uses the cmul patterns 
only for a single real/imaginary pair.


Did proper vectorization of cmul ever really work? There is a case in 
the testsuite for the pattern match, but it isn't in a loop.


Thanks

Andrew

P.S. I attached my testcase, in case I'm doing something stupid.

P.P.S. The manual says the pattern is "cmulm4", etc., but it's actually 
"cmulm3" in the implementation.typedef _Complex double complexT;
#define arraysize 256

void f(
complexT a[restrict arraysize],
complexT b[restrict arraysize],
complexT c[restrict arraysize]
   )
{
#if defined(LOOP)
  for (int i = 0; i < arraysize; i++)
c[i] = a[i] * b[i];
#else

c[0] = a[0] * b[0];
c[1] = a[1] * b[1];
c[2] = a[2] * b[2];
c[3] = a[3] * b[3];
c[4] = a[4] * b[4];
c[5] = a[5] * b[5];
c[6] = a[6] * b[6];
c[7] = a[7] * b[7];
c[8] = a[8] * b[8];
c[9] = a[9] * b[9];
c[10] = a[10] * b[10];
c[11] = a[11] * b[11];
c[12] = a[12] * b[12];
c[13] = a[13] * b[13];
c[14] = a[14] * b[14];
c[15] = a[15] * b[15];
c[16] = a[16] * b[16];
c[17] = a[17] * b[17];
c[18] = a[18] * b[18];
c[19] = a[19] * b[19];
c[20] = a[20] * b[20];
c[21] = a[21] * b[21];
c[22] = a[22] * b[22];
c[23] = a[23] * b[23];
c[24] = a[24] * b[24];
c[25] = a[25] * b[25];
c[26] = a[26] * b[26];
c[27] = a[27] * b[27];
c[28] = a[28] * b[28];
c[29] = a[29] * b[29];
c[30] = a[30] * b[30];
c[31] = a[31] * b[31];
c[32] = a[32] * b[32];
#endif
}

Re: Complex multiply optimization working?

2022-04-11 Thread Andrew Stubbs


On 11/04/2022 13:02, Richard Biener wrote:

You need to check the vectorizer dump whether a complex pattern
was recognized or not.  Did you properly use -ffast-math?


Aha! I needed to enable -ffast-math.

I missed that this is unsafe, and there's a fall-back to _muldc3 on NaN.

OK, presumably I need to implement a vector version of the fall-back 
libcall if I want this to work without ffast-math.


Thanks

Andrew

Re: Complex multiply optimization working?

2022-04-11 Thread Andrew Stubbs


On 11/04/2022 13:03, Tamar Christina wrote:

They work fine in both GCC 11 and HEAD https://godbolt.org/z/Mxxz6qWbP
Did you actually enable the instructions?


Yes, as I said it uses the instructions, just not fully vectorized. 
Anyway, the problem was I needed -ffast-math to skip the NaN checks.



There are both SLP and LOOP variants in the testsuite. All the patterns are 
inside of a loop
The mul tests are generated from 
https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/gcc.dg/vect/complex/complex-mul-template.c

Where the tests that use of this template instructs the vectorizer to unroll 
some cases
and others they're kept as a loop. So both are tested in the testsuite.


Thanks. This is helpful. My grep skills clearly need work.

Andrew

Re: Clarification on newlib version for building AMDGCN offloading backend

2023-03-07 Thread Andrew Stubbs


On 06/03/2023 19:23, Wileam Yonatan Phan via Gcc wrote:

Hi,

I'm working on adding a build recipe for GCC with AMDGCN offloading backend in 
Spack. Can anyone clarify the following sentence listed on the wiki?


The Newlib version needs to be contemporaeous with GCC, at least until the ABI 
is finalized.



What are the correct contemporaneous versions for each version of GCC >= 10?


Just match the dates and you'll probably be fine. We've mostly 
synchronised the ABI changes across the GCC mainline and the development 
branch precisely because the Newlib dependency is shared.


Right now the required version of Newlib is 4.3.0.20230120. Prior to the 
ABI change a month or so ago you would have to use a Newlib snapshot.


I wouldn't recommend spending very much of your valuable time on 
enabling old versions of these toolchains.


Andrew

Libgcc divide vectorization question

2023-03-21 Thread Andrew Stubbs


Hi all,

I want to be able to vectorize divide operators (softfp and integer), 
but amdgcn only has hardware instructions suitable for -ffast-math.


We have recently implemented vector versions of all the libm functions, 
but the libgcc functions aren't builtins and therefore don't use those 
hooks.


What's the best way to achieve this? Add a new __builtin_div (and 
__builtin_mod) that tree-vectorize can find, perhaps? Or something else?


Thanks

Andrew

Re: Libgcc divide vectorization question

2023-03-22 Thread Andrew Stubbs


On 22/03/2023 10:09, Richard Biener wrote:

On Tue, Mar 21, 2023 at 6:00 PM Andrew Stubbs  wrote:


Hi all,

I want to be able to vectorize divide operators (softfp and integer),
but amdgcn only has hardware instructions suitable for -ffast-math.

We have recently implemented vector versions of all the libm functions,
but the libgcc functions aren't builtins and therefore don't use those
hooks.

What's the best way to achieve this? Add a new __builtin_div (and
__builtin_mod) that tree-vectorize can find, perhaps? Or something else?


What do you want to do?  Vectorize the out-of-line libgcc copy?  Or
emit inline vectorized code for int/softfp operations?  In the latter
case just emit the code from the pattern expanders?


I'd like to investigate having vectorized versions of the libgcc 
instruction functions, like we do for libm.


The inline code expansion is certainly an option, but I think there's 
quite a lot of code in those routines. I know how to do that option at 
least (except, maybe not the errno handling without making assumptions 
about the C runtime).


Basically, the -ffast-math instructions will always be the fastest way, 
but the goal is that the default optimization shouldn't just disable 
vectorization entirely for any loop that has a divide in it.


Andrew

Re: Libgcc divide vectorization question

2023-03-22 Thread Andrew Stubbs


On 22/03/2023 13:56, Richard Biener wrote:

Basically, the -ffast-math instructions will always be the fastest way,
but the goal is that the default optimization shouldn't just disable
vectorization entirely for any loop that has a divide in it.


We try to express division as multiplication, but yes, I think there's
currently no way to tell the vectorizer that vectorized division is
available as libcall (nor for any other arithmetic operator that is not
a call in the first place).


I have considered creating a new builtin code, similar to the libm 
functions, that would be enabled by a backend hook, or maybe just if 
TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION doesn't return NULL. The 
vectorizer would then use that, somehow. To treat it just like any other 
builtin it would have to be set before the vectorizer pass encounters 
it, which is probably not ideal for all the other passes that want to 
handle divide operators. Alternatively, the vectorizable_operation 
function could detect and introduce the builtin where appropriate.


Would this be acceptable, or am I wasting my time planning something 
that would get rejected?


Thanks

Andrew

Re: Clarification on newlib version for building AMDGCN offloading backend

2023-03-30 Thread Andrew Stubbs

On 29/03/2023 19:18, Wileam Yonatan Phan wrote:

Hi Andrew,

I just built GCC 12.2.0 with AMDGCN offloading successfully with Spack!
However, when I tried to test it with an OpenACC test code that I have, I
encountered the following error message:

wyp@basecamp:~/work/testcodes/f90-acc-ddot$ gfortran -fopenacc
-foffload=amdgcn-unknown-amdhsa="-march=gfx900" ddot.f90
as: unrecognized option '-triple=amdgcn--amdhsa'
mkoffload: fatal error: x86_64-pc-linux-gnu-accel-amdgcn-unknown-amdhsa-gcc
returned 1 exit status
compilation terminated.
lto-wrapper: fatal error:
/home/wyp/work/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-12.2.0/gcc-12.2.0-w7lclfarefmge3uegn2a5vw37bnwhwto/libexec/gcc/x86_64-pc-linux-gnu/12.2.0//accel/amdgcn-unknown-amdhsa/mkoffload
returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

My guess is that it's trying to use the wrong assembler. Usually this
means there is a problem with your installation procedure and/or your
PATH. I think you should be able to investigate further using -v and/or
strace. The correct one should be named
$DESTDIR/usr/local/amdgcn-amdhsa/bin/as, but this will be different if
you configured GCC with a custom --prefix location. If you have
relocated the toolchain since installation then the toolchain will
attempt to locate libraries and tools relative to the gcc binary. If it
does not find them there then it looks in the "usual places", and those
usually contain an "as" suitable only for the host system.

If you find an error on the Wiki instructions please let me know and I
will correct them.

Andrew

Re: Test with an lto-build of libgfortran.

2023-09-29 Thread Andrew Stubbs


On 28/09/2023 20:59, Toon Moene wrote:

On 9/28/23 21:26, Jakub Jelinek wrote:


It is worse than that, usually the LTO format changes e.g. any time any
option or parameter is added on a release branch (several times a 
year) and

at other times as well.
Though, admittedly GCC is the single package that actually could get away
with LTO in lib*.a libraries, at least in some packagings (if the static
libraries are in gcc specific subdirectories rather than say 
/usr/lib{,64}

or similar and if the packaging of gcc updates both the compiler and
corresponding static libraries in a lock-step.  Because in that case LTO
in there will be always used only by the same snapshot from the release
branch and so should be compatible with the LTO in it.
This might be an argument to make it a configure option, e.g. 
--enable-lto-runtime.


This sort of thing should definitely Just Work for cross compilers and 
embedded platforms where the libraries are bundled with the compiler.


Andrew

Register allocation cost question

2023-10-10 Thread Andrew Stubbs


Hi all,

I'm trying to add a new register set to the GCN port, but I've hit a 
problem I don't understand.


There are 256 new registers (each 2048 bit vector register) but the 
register file has to be divided between all the running hardware 
threads; if you can use fewer registers you can get more parallelism, 
which means that it's important that they're allocated in order.


The problem is that they're not allocated in order. Somehow the IRA pass 
is calculating different costs for the registers within the class. It 
seems to prefer registers a32, a96, a160, and a224.


The internal regno are 448, 512, 576, 640. These are not random numbers! 
They all have zero for the 6 LSB.


What could cause this? Did I overrun some magic limit? What target hook 
might I have miscoded?


I'm also seeing wrong-code bugs when I allow more than 32 new registers, 
but that might be an unrelated problem. Or the allocation is broken? I'm 
still analyzing this.


If it matters, ... the new registers can't be used for general purposes, 
so I'm trying to set them up as a temporary spill destination. This 
means they're typically not busy. It feels like it shouldn't be this 
hard... :(


Thanks in advance.

Andrew

Re: Register allocation cost question

2023-10-11 Thread Andrew Stubbs





On 10/10/2023 20:09, Segher Boessenkool wrote:

Hi Andrew,

On Tue, Oct 10, 2023 at 04:11:18PM +0100, Andrew Stubbs wrote:

I'm also seeing wrong-code bugs when I allow more than 32 new registers,
but that might be an unrelated problem. Or the allocation is broken? I'm
still analyzing this.


It could be connected.  both things should not happen.


If it matters, ... the new registers can't be used for general purposes,


What does this mean?  I think you mean they *can* be used for anything,
you just don't want to (maybe it is slow)?  If you make it allocatable
registers, they *will* be allocated for anythin the compilers deems a
good idea.


Nope, the "Accelerator VGPR" registers are exclusively for the use of 
the new matrix multiply instructions that we don't support (yet).


The compiler is free to use them for storing data, but there are no real 
instructions to do there.



so I'm trying to set them up as a temporary spill destination. This
means they're typically not busy. It feels like it shouldn't be this
hard... :(


So what did you do, put them later in the allocation order?  Make their
register_move_cost higher than for normal registers (but still below
memory_move_cost)?  Or what?  TARGEt_SPILL_CLASS maybe?


We put them in a new register class, with a new constraint, and 
implemented the move instructions (only) with new alternatives for the 
new class. Then implemented TARGET_SPILL_CLASS in the obvious way.


All this is working just fine as long as there are only 32 new registers 
unfixed (a0-a31); the code even runs correctly and I can see the 
spilling happening correctly.


If I enable register a32 then it prefers that, and I get wrong code. 
Using that register ought to be logically correct, albeit suboptimal, so 
I don't understand that either.


Andrew

Re: Register allocation cost question

2023-10-11 Thread Andrew Stubbs


On 11/10/2023 07:54, Chung-Lin Tang wrote:



On 2023/10/10 11:11 PM, Andrew Stubbs wrote:

Hi all,

I'm trying to add a new register set to the GCN port, but I've hit a
problem I don't understand.

There are 256 new registers (each 2048 bit vector register) but the
register file has to be divided between all the running hardware
threads; if you can use fewer registers you can get more parallelism,
which means that it's important that they're allocated in order.

The problem is that they're not allocated in order. Somehow the IRA pass
is calculating different costs for the registers within the class. It
seems to prefer registers a32, a96, a160, and a224.

The internal regno are 448, 512, 576, 640. These are not random numbers!
They all have zero for the 6 LSB.

What could cause this? Did I overrun some magic limit? What target hook
might I have miscoded?

I'm also seeing wrong-code bugs when I allow more than 32 new registers,
but that might be an unrelated problem. Or the allocation is broken? I'm
still analyzing this.

If it matters, ... the new registers can't be used for general purposes,
so I'm trying to set them up as a temporary spill destination. This
means they're typically not busy. It feels like it shouldn't be this
hard... :(


Have you tried experimenting with REG_ALLOC_ORDER? I see that the GCN port 
currently isn't using this target macro.


The default definition is 0,1,2,3,4 and is already the desired 
behaviour.


Andrew

Re: Register allocation cost question

2023-10-11 Thread Andrew Stubbs


On 10/10/2023 20:09, Segher Boessenkool wrote:

Hi Andrew,

On Tue, Oct 10, 2023 at 04:11:18PM +0100, Andrew Stubbs wrote:

I'm also seeing wrong-code bugs when I allow more than 32 new registers,
but that might be an unrelated problem. Or the allocation is broken? I'm
still analyzing this.


It could be connected.  both things should not happen.


This is now confirmed to be unrelated: the instruction moving values 
from the new registers to the old must be followed by a no-op in certain 
instruction combinations due to GCN having only partial hardware 
dependency detection.


The register allocation is therefore valid (at least in the testcases 
I've been looking at).


The question of why it prefers registers with round numbers remains open 
(and important for optimization reasons).


Andrew

Register allocation problem

2023-12-12 Thread Andrew Stubbs


Hi all,

I'm trying to solve an infinite loop in the "reload" pass (LRA). I need 
early-clobber on my load instructions and it goes wrong when register 
pressure is high.


Is there a proper way to fix this? Or do I need to do something "hacky" 
like fixing a register for use with reloads?


Here's the background .

AMD GCN has a thing called XNACK mode in which load instructions can be 
interrupted (by a page miss, for example) and therefore need to be 
written such that they are "restartable". This basically means that the 
output must not overwrite the input registers (it can happen that a load 
is partially successful, especially for vectors, but I believe 
overwriting the address and offsets is never safe, even for scalars). Up 
to now we've not needed this mode, but it will be needed for Unified 
Shared Memory (and theoretically for APU devices).


So I have added new alternatives into my machine description that use 
early-clobber set:


  [v   ,RF  ;flat ,*   ,12,*,off] flat_load%o1\t%0, %A1%O1%g1
  [&v  ,RF  ;flat ,*   ,12,*,on ] ^

(The "on" and "off" represent the XNACK mode.)

LRA then generates a register "Assignment" section in the dump, but it's 
not happy for some reason and generates another, and another, each with 
more and more pseudo registers and insns, and it goes on forever until 
the dump file is gigabytes and I kill it.


This is a vague description, sorry, because I don't really understand 
what's going on here and the dump files are huge with tens of thousands 
of pseudo registers to wade through. I'm hoping somebody recognises the 
issue without me spending days on it.


I have a workaround because there's no known failure on devices that 
have the AVGPR register file (they use it as spill space and therefore 
don't need the memory loads) and I actually don't need XNACK on the 
older devices at this time, but probably this is just pushing the 
problem further down the road so if there's a better solution then I'd 
like to find it.


Thanks in advance

Andrew

Re: [RFC] MAINTAINERS: require a BZ account field

2024-06-25 Thread Andrew Stubbs


On 24/06/2024 23:34, Arsen Arsenović via Gcc wrote:

I was also proposing (and would like to re-air that here) enforcing that
the committer field of each commit is a (valid) @gcc.gnu.org email.
This can be configured repo-locally via:

   $ git config committer.email @gcc.gnu.org

Git has supported this since 39ab4d0951ba64edcfae7809740715991b44fa6d
(v2.22.0).

This makes a permanent association of each commit to its authors
Sourceware account.

This should not inhibit pushes, as the committer should be a reflection
of who /applied/ a patch, and anyone applying a patch that can also push
has a Sourceware account.  It also should not inhibit any workflow, as
it should be automatic.


This will make it hard to a) find emails from a maintainer/committer in 
the mailing list archives -- since it's not generally possible to send 
"From:" a @gcc.gnu.org address -- and b) make it hard to compile 
statistics about contributions from corporate domains (which people do do).


I do not object to having the bugzilla account also listed in 
MAINTAINERS, although I've never had trouble finding people by name.


Andrew

Re: [RFC] MAINTAINERS: require a BZ account field

2024-06-25 Thread Andrew Stubbs


On 25/06/2024 10:05, Arsen Arsenović wrote:

Hi,

Andrew Stubbs  writes:


On 24/06/2024 23:34, Arsen Arsenović via Gcc wrote:

I was also proposing (and would like to re-air that here) enforcing that
the committer field of each commit is a (valid) @gcc.gnu.org email.
This can be configured repo-locally via:
$ git config committer.email @gcc.gnu.org
Git has supported this since 39ab4d0951ba64edcfae7809740715991b44fa6d
(v2.22.0).
This makes a permanent association of each commit to its authors
Sourceware account.
This should not inhibit pushes, as the committer should be a reflection
of who /applied/ a patch, and anyone applying a patch that can also push
has a Sourceware account.  It also should not inhibit any workflow, as
it should be automatic.


This will make it hard to a) find emails from a maintainer/committer in the
mailing list archives -- since it's not generally possible to send "From:" a
@gcc.gnu.org address -- and b) make it hard to compile statistics about
contributions from corporate domains (which people do do).


I'm not sure that is the case - the committer field is separate to the
author field, so all statistics one could do with the author field
remain unaltered.  For instance (as I've been doing that for a while),
here's what a commit of mine looks like under that scheme:

   commit 36cb7be477885a2464fe9a70467278c7debd5e79
   Author: Arsen Arsenović 
   AuthorDate: Thu Nov 16 23:50:30 2023 +0100
   Commit: Arsen Arsenović 
   CommitDate: Wed Dec 13 13:17:35 2023 +0100

   gettext: disable install, docs targets, libasprintf, threads

More often than not, the committer field is redundant with the author
field.

The email I use for correspondence is still present (and, in fact, is
the only one visible with the default git log and show formats).

It is possible that someone could be doing statistics based on the
committer field, if they also want to, say, count patches applied by
members of some company, but I'm not sure how wide-spread that is.


OK; fair point.

Andrew

Accessing the subversion repository

2005-02-15 Thread Andrew STUBBS

Hi,

Joern and I are having difficulty accessing the subversion test repository.

"svn co svn://svn.toolchain.org/svn/gcc/trunk" does not work due to the ST
corporate firewall (don't ask - the wheels turn slowly), so I have been
looking for an alternative approach.

Is there any alternative to straight svn protocol set up? I know subversion
supports them, but I haven't seen anything on the list or wiki announcing
it.

I have tried "svn co svn+shh://svn.toolchain.org/svn/gcc/trunk", but that
does not work either. I appear to have contacted the remote ssh server, but
the devtest username and password given do not seem to work. Are they are
for svn only?. Or am I missing some settings somewhere?

I have also tried "svn co http://svn.toolchain.org/svn/gcc/trunk";, but this
is just a guess and receives the response "302 Moved Temporarily" (once I
figured out .subversion/servers proxy settings). Is this just the wrong URL?
It might be that our proxy does not support DAV?

I can access the repository via web browser at
http://www.toolchain.org/websvn/listing.php?repname=gcc&path=%2F&sc=0 but
that isn't exactly what I had in mind.

The subversion book at red-bean seems to cover server setup, but is rather
light on client setup (not that I have read it in detail).

If no other method currently exists, is there any possibility of getting one
set up? Also, once (if) the subversion repository goes 'live' on
gcc.gnu.org, is there any plans to support the alternative protocols?

Thanks

--
Andrew Stubbs
[EMAIL PROTECTED]

RE: Accessing the subversion repository

2005-02-15 Thread Andrew STUBBS

> The ssh username is actually gcc, password foo2bar
> 
> so svn+ssh://[EMAIL PROTECTED]/gcc/trunk
> 
> would work (note for ssh, it's /gcc/trunk, not /svn/gcc/trunk. This is
> because it's running svnserve with a different root.  Just an 
> oversight,
> AFAIK :P)

Excellent. I now have a successful checkout. I have added this info to the
wiki as I suspect it will be important to more than just myself.

> I should note that svn treats it's remote connections as 
> disposable, so
> svn+ssh will probably connect more than once for things like remote
> diffs.  So if it takes a while to authenticate, this may not be your
> best bet if you are looking for blazing speed (as some seem to be :P).

Isn't there some was of setting up a svnserve deamon or something? I'm sure
I read that somewhere, or maybe I just misunderstood something somewhere.
Anyway, I can live with it for the moment.

Thanks a lot.

--
Andrew Stubbs
[EMAIL PROTECTED]

RE: Accessing the subversion repository

2005-02-15 Thread Andrew STUBBS

> > > I should note that svn treats it's remote connections as 
> > > disposable, so
> > > svn+ssh will probably connect more than once for things 
> like remote
> > > diffs.  So if it takes a while to authenticate, this may 
> not be your
> > > best bet if you are looking for blazing speed (as some 
> seem to be :P).
> > 
> > Isn't there some was of setting up a svnserve deamon or 
> something? I'm sure
> > I read that somewhere, or maybe I just misunderstood 
> something somewhere.
> > Anyway, I can live with it for the moment.
> 
> This is the svnserve daemon (that's what svn:// and svn+ssh:// urls
> access). :)
> svnserve is the proprietary protocol like pserver.
> http uses DAV.

When accessing a server via ssh svn spawns an svnserve with the -t option,
does it not? I got the impression from somewhere that this could be made to
persist.

However, since svnserve clearly does persist when run as a local server (in
deamon mode, not inetd) it is clear where I could have gotten the wires
crossed.

It is possible to run a local server as a proxy for a remote server, in
order to limit the number of password requests?

RE: Accessing the subversion repository

2005-02-17 Thread Andrew STUBBS

> Recent versions of openssh support multiple connections through one
> single authentication token (`master' connection)

That might work, but you need version OpenSSH 3.9 I think. svn.toolchain.org
is running 3.8.1p1 and gcc.gnu.org is running 3.6.1p2.

I assume both ends need to support it but I can't easily test that
assumption because I don't have 3.9 locally either.

Thanks anyway

-- 
Andrew Stubbs

Why does lower-subreg mark copied pseudos as "decomposable"?

2012-04-17 Thread Andrew Stubbs


Hi all,

I can see why copying from one pseudo-register to another would not be a 
reason *not* to decompose a register, but I don't understand why this is 
a reason to say it *should* be decomposed.


This is causing me trouble, and I can't tell how to fix it without 
figuring out why it is this way in the first place.


My testcase is from pr43137. This was an ARM missed-optimization bug 
that was fixed some time ago, but has recurred (in my tree) because I'm 
trying to implement SI->DImode extend into 64-bit NEON registers.


Here are the problem insns:

(insn 7 6 8 2 (set (reg:DI 137)
(sign_extend:DI (reg/v:SI 134 [ resultD.4946 ]))) pr43137.c:8
158 {extendsidi2}
 (nil))

(insn 8 7 12 2 (set (reg:DI 136 [  ])
(reg:DI 137)) pr43137.c:8 641 {*movdi_vfp}
 (nil))

(insn 12 8 15 2 (set (reg/i:DI 0 r0)
(reg:DI 136 [  ])) pr43137.c:9 641 {*movdi_vfp}
 (nil))

Lower-subreg thinks it should decompose pseudo 136 because there is a 
pseudo-to-pseudo copy (137->136), even though there is no use of subregs 
here.


The decomposition ends up preventing register allocation from allocating 
r0 to pseudo-137, and we get an unnecessary move emitted and a 
regression of pr43137.



So, why do we have this code?

[lower-subreg.c, find_decomposable_subregs]

case SIMPLE_PSEUDO_REG_MOVE:
  if (MODES_TIEABLE_P (GET_MODE (x), word_mode))
bitmap_set_bit (decomposable_context, regno);
  break;

If I remove these lines my problems go away.

Any clues would be appreciated.

Thanks

Andrew

Re: Why does lower-subreg mark copied pseudos as "decomposable"?

2012-04-17 Thread Andrew Stubbs


On 17/04/12 18:20, Richard Sandiford wrote:

Andrew Stubbs  writes:

Hi all,

I can see why copying from one pseudo-register to another would not be a
reason *not* to decompose a register, but I don't understand why this is
a reason to say it *should* be decomposed.


The idea is that, if a backend implements an N-word pseudo move using
N word-mode moves, it is better to expose those moves before register
allocation.  It's easier for RA to find N separate word-mode registers
than a single contiguous N-word one.


Ok, I think I understand that, but it seems slightly wrong to me.

It makes sense to lower *real* moves, but before the fwprop pass there 
are quite a lot of pseudos that only exist as artefacts of the expand 
process. Moving the subreg1 pass after fwprop1 would probably do the 
trick, but that would probably also defeat the object of lowering early.


I've done a couple of experiments:

First, I tried adding an extra fwprop pass before subreg1. I needed to 
move up the dfinit pass also to make that work, but then it did work: it 
successfully compiled my testcase without a regression.


I'm not sure that adding an extra pass isn't overkill, so second I tried 
adjusting lower-subreg to avoid this problem; I modified 
find_pseudo_copy so that it rejected copies that didn't change the mode, 
on the principle that fwprop would probably have eliminated the move 
anyway. This was successful also, and a much less expensive change.


Does that make sense? The pseudos involved in the move will still get 
lowered if the other conditions hold.



The problem is the "if a backend implements ..." bit: the current code
doesn't check.  This patch:

 http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00094.html

should help.  It's still waiting for me to find a case where the two
possible ways of handling hot-cold partitioning behave differently.


I've not studied that patch in detail, but I'm not sure it'll help. In 
most cases, including my testcase, lowering is the correct thing to do 
if NEON (or IWMMXT, perhaps) is not enabled. When NEON is enabled, 
however, it may still be the right thing to do: NEON does not provide a 
full set of DImode operations. The test for subreg-only uses ought to be 
enough to differentiate, once the extraneous pseudos such as the one in 
my testcase have been dealt with.


Anyway, please let me know what you think of my solutions above, and 
I'll cook up a patch if they're ok.


Andrew

Re: Why does lower-subreg mark copied pseudos as "decomposable"?

2012-04-18 Thread Andrew Stubbs


On 18/04/12 11:55, Richard Sandiford wrote:

The problem is that not all register moves are always going to be
eliminated, even when no mode changes are involved.  It might make
sense to restrict that code you quoted:

case SIMPLE_PSEUDO_REG_MOVE:
  if (MODES_TIEABLE_P (GET_MODE (x), word_mode))
bitmap_set_bit (decomposable_context, regno);
  break;

to the second pass though.


Yes, I thought of that, but I dismissed it because the second pass is 
really very late. It would be just in time to take advantage of the 
relaxed register allocation, but would miss out on all the various 
optimizations that forward-propagation, combining, and such can offer.


This is why I've tried to find a way to do something about it in the 
first pass. I thought it makes sense to do something for none-no-op 
moves (when is there such a thing, btw, without it being and extend, 
truncate, or subreg?), but the no-op moves are trickier.


Perhaps a combination of the two ideas? Decompose mode-changing moves in 
the first pass, and all moves in the second?


BTW, the lower-subreg pass has a forward propagation concept of its own. 
If I read it right, even with the above changes, it will still decompose 
the move if the register it copies from has been decomposed, and the 
register it copies to is not marked 'non-decomposable'.


Hmm, I'm going to try to come up with some testcases that demonstrate 
the different cases and see if that helps me think about it. Do you 
happen to have any to hand?



I've not studied that patch in detail, but I'm not sure it'll help. In
most cases, including my testcase, lowering is the correct thing to do
if NEON (or IWMMXT, perhaps) is not enabled.


Right.  I think I misunderstood, sorry.  I thought this regression was
for NEON only, but do you mean that adding these NEON patterns introduces
the regression for non-NEON targets as well?


No, you were right, the regression only occurs when NEON is enabled. 
Otherwise the machine description behaves exactly as it used to.



When NEON is enabled, however, it may still be the right thing to do:
NEON does not provide a full set of DImode operations. The test for
subreg-only uses ought to be enough to differentiate, once the
extraneous pseudos such as the one in my testcase have been dealt
with.


OK.  If/when that patches goes in, the ARM backend is going to have
to pick an rtx cost for DImode SETs.  It sounds like the cost will need
to be twice an SImode move regardless of whether or not NEON is enabled.


That sounds reasonable. Of course, how much a register move costs is a 
tricky subject for NEON anyway. :(


Andrew

Re: Why does lower-subreg mark copied pseudos as "decomposable"?

2012-04-18 Thread Andrew Stubbs


On 18/04/12 16:53, Richard Sandiford wrote:

Andrew Stubbs  writes:

On 18/04/12 11:55, Richard Sandiford wrote:

The problem is that not all register moves are always going to be
eliminated, even when no mode changes are involved.  It might make
sense to restrict that code you quoted:

case SIMPLE_PSEUDO_REG_MOVE:
  if (MODES_TIEABLE_P (GET_MODE (x), word_mode))
bitmap_set_bit (decomposable_context, regno);
  break;

to the second pass though.


Yes, I thought of that, but I dismissed it because the second pass is
really very late. It would be just in time to take advantage of the
relaxed register allocation, but would miss out on all the various
optimizations that forward-propagation, combining, and such can offer.

This is why I've tried to find a way to do something about it in the
first pass. I thought it makes sense to do something for none-no-op
moves (when is there such a thing, btw, without it being and extend,
truncate, or subreg?),


AFAIK there isn't, which is why I'm a bit unsure what you're suggesting.


And why I don't understand what the current code is trying to achieve.


Different modes like DI and DF can both be stored in NEON registers,
so if you have a situation where one is punned into the other,
I think that's an even stronger reason to want to keep them together.


Does the compiler use pseudo-reg copies for that? I thought it mostly 
just referred to the same register with a different mode and everything 
just DTRT.


OK, let's go back to the start: at first sight, the lower-subregs pass 
decomposes every psuedo-register that is larger than a core register, is 
only defined or used via subreg or a simple copy, or is a copy of a 
decomposed register that has no non-decomposable features itself 
(forward propagation). It does not deliberately decompose 
pseudo-registers that are only copies from or to a hard-register, even 
though there's nothing intrinsically non-decomposable about that 
(besides that there's no benefit), but it can happen if forward 
propagation occurs. It explicitly does not decompose any pseudo that is 
used in a non-move DImode operation.


All this makes sense to me: if the backend is written such that DImode 
operations are expanded in terms of SImode subregs, then it's better to 
think of the subregs independently. (On ARM, this *is* the case when 
NEON is disabled.)


But then there's this extra "feature" that a pseudo-to-pseudo copy 
triggers both pseudo registers to be considered decomposable (unless 
there's some other use that prohibits it), and I don't know why?


Yes, I understand that a move from NEON to core might benefit from this, 
but those don't exist before reload. I also theorized that moves that 
convert to some other kind of mode might be interesting (the existing 
code checks for "tieable" modes, presumable with reason), but I can't 
come up with a valid example (mode changes usually require a non-move 
operation of some kind).


In fact, the only examples of a pseudo-pseudo copy that won't be 
eliminated by fwprop et al would be to do with loops and conditionals, 
and I don't understand why they should be special.


The result of this extra feature is that if I copy the output of a 
DImode insn *directly* to a DImode hard reg (say a return value) then 
there's no decomposition, but if the expand pass happens to have put an 
intermediate pseudo register (as it does do) then this extra rule 
decomposes it most unhelpfully (ok, there's only actually a problem if 
the compiler can reason that one subreg or the other is unchanged, as is 
the case with sign_extend).


So, after having thought all this through again, unless somebody can 
show why not, I propose that we remove this mis-feature entirely, or at 
least disable it in the first pass.


Andrew

Re: Why does lower-subreg mark copied pseudos as "decomposable"?

2012-04-19 Thread Andrew Stubbs


On 18/04/12 21:47, Richard Sandiford wrote:

I don't think the idea is that these cases are special in themselves.
What we're looking for are pseudos that _may_ be decomposed into
separate registers.  If one of the pseudos in the move is only used in
decomposable contexts (including nonvolatile loads and stores, as well
as copies to and from hard registers, etc.), then we may be able to
completely replace the original pseudo with two smaller ones.  E.g.:

 (set (reg:DI X) (mem:DI ...))
 ...
 (set (reg:DI Y) (reg:DI X))

In this case, X can be completely replaced by two SImode registers.

What isn't clear to me is why we don't seem to do the same for:

 (set (reg:DI X) (mem:DI ...))
 (set (mem:DI ...) (reg:DI X))


My reading would be: if the backend wanted to lower these then it would 
have expanded the load differently in the first place. The fact that the 
machine description says "reg:DI", rather than "subreg:SI (reg:DI" shows 
that the 64-bit load is the more-optimal method (at least in the view of 
the author).


(I have the same complaint about the lower-subreg zero-extend 
special-handling: if the backend wanted it lowered that way, why didn't 
it code it that way?)


Look at pr43137; the old code was sub-optimal and was replaced with a 
pre-lowered version. (Admittedly, it was the pseudo-to-pseudo bug that 
highlighted the problem.) Now, I want to indicate the opposite, when 
NEON is available: that preferred form for and extend is the non-lowered 
form, but this mysterious copy-rule has defeated me.



Perhaps we do and I'm just misreading the code.  Or perhaps it's just
too hard to get the costs right.  Splitting that would be moving even
further from what you want though :-)


No, I don't think you misunderstood it: in the example, X is used as 
"reg:DI" and that should prohibit lowering (even if there's a subreg use 
somewhere else).


Here's another example that worries me:

(set (reg:DI X) (mem:DI ...))
...
(set (reg:DI Y) (reg:DI X))
...
(set (...) (...:SI (subreg:SI (reg:DI Y) 0)))
(set (...) (...:SI (subreg:SI (reg:DI Y) 4)))

Without the copy, there would be no lowering in this example. With the 
copy (and let's remember that they're not added deliberately; they're 
just artefacts of the expand process) the Y register will be lowered, 
but the X not.


I worries me that these "optimizations" are relying on 
"undocumented"/"undefined"/unsomethinged behaviour (at best). In this 
example it probably won't make any difference to the register 
allocation, but in pr43137 is was this pattern and the combination of 
hard-regs and sign_extend being known to only modify one of the subregs 
caused the trouble.



The result of this extra feature is that if I copy the output of a
DImode insn *directly* to a DImode hard reg (say a return value) then
there's no decomposition, but if the expand pass happens to have put an
intermediate pseudo register (as it does do) then this extra rule
decomposes it most unhelpfully (ok, there's only actually a problem if
the compiler can reason that one subreg or the other is unchanged, as is
the case with sign_extend).


But remember that this pass is not designed for targets like NEON that
have lots of native double-word operations.  It's designed for targets
that don't.  I think you said earlier that your testcase was handled
correctly for non-NEON, which is the point: decomposing in that case
_may_ be a benefit (if we end up being able to replace all uses of a
doubleword pseudo with uses of 2 word pseudos) and should be no worse
otherwise.


True, but there ought to be a way to achieve the goal that doesn't 
penalize targets that have a few doubleword operations?


I'm coming to the conclusion that this (and the zero_extend/shift 
special-handling) is some attempt to paper over the short-comings of 
some backends (ok, probably all, somewhere) that don't expand to say 
what they mean, but instead rely on split, or have insns that output 
multiple instructions.


Lowering a register that is only accessed via (subreg:SI (reg:DI ..)) 
is, of course, a useful thing to do (it's what the machine description 
is *asking* for), and we should do that after both expand and split, but 
these extra "smarts" are deeply suspicious.



So, after having thought all this through again, unless somebody can
show why not, I propose that we remove this mis-feature entirely, or at
least disable it in the first pass.


I still prefer the idea of disabling in the first pass.  It'll need to
be tested on something like non-NEON ARM to see whether it makes things
worse or better there.  (I think size testing would be fine.)


I'll have a go, and see what happens.

Andrew

Re: Why does lower-subreg mark copied pseudos as "decomposable"?

2012-05-04 Thread Andrew Stubbs


On 19/04/12 17:36, Andrew Stubbs wrote:

On 18/04/12 21:47, Richard Sandiford wrote:

I still prefer the idea of disabling in the first pass. It'll need to
be tested on something like non-NEON ARM to see whether it makes things
worse or better there. (I think size testing would be fine.)


I'll have a go, and see what happens.


So far I've found that many examples give smaller code with this change, 
and a few examples that give larger code. However, on average it appears 
to give better code, size wise. This is on ARM when NEON is not enabled; 
when NEON is enabled the results are far better, as expected.


I did have a small example that showed much worse register allocation, 
but I can't reproduce that with the latest trunk.


Most of the size reductions can be explained by use of 64-bit loads and 
stores, rather that pairs of 32-bit accesses.


In thumb mode, one cause of size increases appears to be that there are 
no more instructions, but that it has used 32-bit opcodes rather than 
16-bit ones; this is unfortunate.


Otherwise, it's very difficult to identify where the tiny size increases 
come from.


As an example, I compiled (a slightly old copy of) gcc/expmed.c which 
contains a lot of 64-bit operations, and compared the output sizes at 
-O2. Of 43 functions, 37 show no change whatsoever, 5 showed a reduction 
(21 bytes on average), and 1 function showed a 20 byte increase.


The end result is I'm going to try produce a proper patch to post.

Andrew

Re: gcc-get enabling-only subscription?

2008-05-13 Thread Andrew STUBBS


Joern Rennecke wrote:

  You could sub up to the digest mode, which might at least be less of a
burden,



It would reduce the number of messages, but the volume would still be
very high.


Hi Joern,

You could just sign up to one of the online mail list services. Here's 
the nabble link for this thread:


http://www.nabble.com/gcc-get-enabling-only-subscription--to17209319.html

You need to register to reply, of course.

HTH

Andrew

Re: Libmudflap for sh-elf toolchain cannot access environment variable MUDFLAP_OPTIONS

2007-06-15 Thread Andrew STUBBS


Deepen Mantri wrote:
How to make x86/linux shell's environment variable 
(MUDFLAP_OPTIONS) accessible to test.out while executing
it through the sh-elf simulator? 


I don't know about other targets, but the SH newlib/crt/simulator 
doesn't do anything with the environment.


You could spend ages modifying the simulator and C runtime library to 
have the environment copied to the target.


It's probably easier to place a putenv("MUDFLAP_OPTIONS=blah") in your 
code, or inject it from the debugger.


Andrew

Re: Libmudflap for sh-elf toolchain cannot access environment variable MUDFLAP_OPTIONS

2007-06-15 Thread Andrew STUBBS


Deepen Mantri wrote:

We cannot place putenv("MUDFLAP_OPTIONS=<..>") in
libmudflap's __mf_init() function existing in mf-runtime.c.
Placing putenv(..) will limit the instrumented code's 
runtime behaviour only to option being set in the code by me.


Well no, that would be silly - you might as well cut out the whole 
environment read entirely. Put it somewhere else - main perhaps.


If you don't want to do that, you still have the option of injecting the 
call via the debugger:


(gdb) break main
(gdb) continue

(gdb) call putenv("MUDFLAP_OPTIONS=<..>")
(gdb) continue

Of course, you have to have putenv linked into the program, but that can 
be arranged easily enough.


Andrew

Re: [RFC] Enabling SVE with offloading to nvptx

2024-11-04 Thread Andrew Stubbs


[...skip literally unreadable deeply nested conversation...]

A couple of years ago I posted a patch to this same code solving a 
performance problem with x86_64/amdgcn offloading:

https://patchwork.sourceware.org/project/gcc/patch/0e1a740e-46d5-ebfa-36f4-9a069ddf8...@codesourcery.com/

At that time, the patch was rejected, and I didn't have time to make the 
requested edits due to higher priorities.


Co-incidentally, I just started working on this again a week or so ago. 
I now have a patch series nearly ready to go (I was just about to work 
on adding some testcases), but on reading the list this morning I find 
that it conflicts with your patch already posted.


Some of the patch is nearly identical (such as a new IFN with the 
obvious not-an-expander), but not quite. Instead of adding a whole new 
pass I have simply enabled ompdevlow for this case. Also, I didn't need 
to do anything about SIMT for my usecase.


BabelStream "dot" benchmark (gfx90a):

Baseline:   364541 MBytes/sec
Your patch: 354892 MBytes/sec -- same, within noise
My patches: 574802 MBytes/sec -- 1.6x speedup[*]

So, your patch doesn't fix my problem, and I imagine my patch doesn't 
fix your problem (because max_vf remains "1" when offloading to SIMT 
devices).


Only patch 1/3 is actually needed to fix my benchmark. The other two are 
increasingly thorough handling of the other cases.


To do this thing perfectly I think we need to delay the SIMT cases as 
well, so as not to hurt AArch64 hosts, but I still need to figure out 
why your solution is not working for me.


Andrew


[*] My original post claimed a 10x speedup, but that was when amdgcn 
only had V64 vector modes, so setting "max_vf = 16" resulted in total 
vectorizer failure. Now that amdgcn has V16 modes the baseline result is 
much better, but max_vf really does need to be 64.From 69db90d5639c4ce082136eee032ab63a61a32035 Mon Sep 17 00:00:00 2001
From: Andrew Stubbs 
Date: Mon, 21 Oct 2024 12:29:54 +
Subject: [PATCH 1/3] openmp: Tune omp_max_vf for offload targets

If requested, return the vectorization factor appropriate for the offload
device, if any.

This change gives a significant speedup in the BabelStream "dot" benchmark on
amdgcn.

The omp_adjust_chunk_size usecase is set "false", for now, but I intend to
change that in a follow-up patch.

Note that NVPTX SIMT offload does not use this code-path.

gcc/ChangeLog:

	* gimple-loop-versioning.cc (loop_versioning::loop_versioning): Set
	omp_max_vf to offload == false.
	* omp-expand.cc (omp_adjust_chunk_size): Likewise.
	* omp-general.cc (omp_max_vf): Add "offload" parameter, and detect
	amdgcn offload devices.
	* omp-general.h (omp_max_vf): Likewise.
	* omp-low.cc (lower_rec_simd_input_clauses): Pass offload state to
	omp_max_vf.
---
 gcc/gimple-loop-versioning.cc |  2 +-
 gcc/omp-expand.cc |  2 +-
 gcc/omp-general.cc| 17 +++--
 gcc/omp-general.h |  2 +-
 gcc/omp-low.cc|  3 ++-
 5 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/gcc/gimple-loop-versioning.cc b/gcc/gimple-loop-versioning.cc
index 107b0020024..2968c929d04 100644
--- a/gcc/gimple-loop-versioning.cc
+++ b/gcc/gimple-loop-versioning.cc
@@ -554,7 +554,7 @@ loop_versioning::loop_versioning (function *fn)
  handled efficiently by scalar code.  omp_max_vf calculates the
  maximum number of bytes in a vector, when such a value is relevant
  to loop optimization.  */
-  m_maximum_scale = estimated_poly_value (omp_max_vf ());
+  m_maximum_scale = estimated_poly_value (omp_max_vf (false));
   m_maximum_scale = MAX (m_maximum_scale, MAX_FIXED_MODE_SIZE);
 }
 
diff --git a/gcc/omp-expand.cc b/gcc/omp-expand.cc
index b0b4ddf5dbc..907fd46a5b2 100644
--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -212,7 +212,7 @@ omp_adjust_chunk_size (tree chunk_size, bool simd_schedule)
   if (!simd_schedule || integer_zerop (chunk_size))
 return chunk_size;
 
-  poly_uint64 vf = omp_max_vf ();
+  poly_uint64 vf = omp_max_vf (false);
   if (known_eq (vf, 1U))
 return chunk_size;
 
diff --git a/gcc/omp-general.cc b/gcc/omp-general.cc
index f74b9bf5e96..223f6037270 100644
--- a/gcc/omp-general.cc
+++ b/gcc/omp-general.cc
@@ -987,10 +987,11 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *data)
   return NULL_TREE;
 }
 
-/* Return maximum possible vectorization factor for the target.  */
+/* Return maximum possible vectorization factor for the target, or for
+   the OpenMP offload target if one exists.  */
 
 poly_uint64
-omp_max_vf (void)
+omp_max_vf (bool offload)
 {
   if (!optimize
   || optimize_debug
@@ -999,6 +1000,18 @@ omp_max_vf (void)
 	  && OPTION_SET_P (flag_tree_loop_vectorize)))
 return 1;
 
+  if (ENABLE_OFFLOADING && offload)
+{
+  for (const char *c = getenv ("OFFLOAD_TARGET_NAMES"); c;)
+	{
+	  if (startswith (c, "amdgcn"

Re: [RFC] Enabling SVE with offloading to nvptx

2024-11-12 Thread Andrew Stubbs

On 12/11/2024 06:01, Prathamesh Kulkarni via Gcc wrote:

-Original Message-
From: Jakub Jelinek 
Sent: 04 November 2024 21:44
To: Prathamesh Kulkarni 
Cc: Richard Biener ; Richard Biener
; gcc@gcc.gnu.org; Thomas Schwinge

Subject: Re: [RFC] Enabling SVE with offloading to nvptx

External email: Use caution opening links or attachments

On Sat, Nov 02, 2024 at 03:53:34PM +, Prathamesh Kulkarni wrote:

The attached patch adds a new bitfield needs_max_vf_lowering to

loop,

and sets that in expand_omp_simd for loops that need delayed

lowering

of safelen and omp simd arrays.  The patch defines a new macro
OMP_COMMON_MAX_VF (arbitrarily set to 16), as a placeholder value

for

max_vf (instead of INT_MAX), and is later replaced by appropriate
max_vf during omp_adjust_max_vf pass.  Does that look OK ?

No.
The thing is, if user doesn't specify safelen, it defaults to infinity
(which we represent as INT_MAX), if user specifies it, then that is
the maximum for it (currently in OpenMP specification it is just an
integral value, so can't be a poly int).
And then the lowering uses the max_vf as another limit, what the hw
can do at most and sizes the magic arrays with it.  So, one needs to
use minimum of what user specified and what the hw can handle.
So using 16 as some magic value is just wrong, safelen(16) can be
specified in the source as well, or safelen(8), or safelen(32) or
safelen(123).

Thus, the fact that the hw minimum hasn't been determined yet needs to
be represented in some other flag, not in loop->safelen value, and
before that is determined, loop->safelen should then represent what
the user wrote (or was implied) and the later pass should use minimum
from loop->safelen and the picked hw maximum.  Of course if the picked
hw maximum is POLY_INT-ish, the big question is how to compare that
against the user supplied integer value, either one can just handle
the INT_MAX (aka
infinity) special case, or say query the backend on what is the
maximum value of the POLY_INT at runtime and only use the POLY_INT if
it is always known to be smaller or equal to the user supplied
safelen.

Another thing (already mentioned in the thread Andrew referenced) is
that max_vf is used in two separate places.  One is just to size of
the magic arrays and one of the operands of the minimum (the other is
user specified safelen).  In this case, it is generally just fine to
pick later value than strictly necessary (as long as it is never
larger than user supplied safelen).
The other case is simd modifier on schedule clause.  That value should
better be the right one or slightly larger, but not too much.
I think currently we just use the INTEGER_CST we pick as the maximum,
if this sizing is deferred, maybe it needs to be another internal
function that asks the value (though, it can refer to a loop vf in
another function, which complicates stuff).

Regarding Richi's question, I'm afraid the OpenMP simd loop lowering
can't be delayed until some later pass.

Hi Jakub,
Thanks for the suggestions! The attached patch makes the following changes:
(1) Delays setting of safelen for offloading by introducing a new bitfield 
needs_max_vf_lowering in loop, which is true with offloading enabled,
and safelen is then set to min(safelen, max_vf) for the target later in 
omp_device_lower pass.
Comparing user-specified safelen with poly_int max_vf may not be always 
possible at compile-time (say 32 and 16+16x),
and even if we determine runtime VL based on -mcpu flags, I guess relying on 
that won't be portable ?
The patch works around this by taking constant_lower_bound (max_vf), and 
comparing it with safelen instead, with the downside
that constant_lower_bound(max_vf) will not be the optimal max_vf for SVE target if 
it implements SIMD width > 128 bits.

(2) Since max_vf is used as length of omp simd array, it gets streamed out to 
device, and device compilation fails during streaming-in if max_vf
is poly_int (16+16x), and device's NUM_POLY_INT_COEFFS < 2 (which motivated my 
patch). The patch tries to address this by simply setting length to a
placeholder value (INT_MAX?) in lower_rec_simd_input_clauses if offloading is 
enabled, and will be later set to appropriate value in omp_device_lower pass.

(3) Andrew's patches seems to already fix the case for adjusting chunk_size for 
schedule clause with simd modifier by introducing a new internal
function .GOMP_MAX_VF, which is then replaced by target's max_vf. To keep it 
consistent with safelen, the patch here uses constant_lower_bound (max_vf) too.

Patch passes libgomp testing for AArch64/nvptx offloading (with and without 
GPU).
Does it look OK ?

I've not reviewed the patch in detail, but I can confirm that this does 
not break my usecase or cause any test regressions, for me.

However, are you sure that ompdevlow is always running? I think you need 
to add this, somewhere:

  cfun->curr_properties &= ~PROP_gimple_lomp_dev;

My patch added this into omp_adjust_chunk_size, wh

Re: Help for git send-email setting

2025-01-14 Thread Andrew Stubbs


On 13/01/2025 01:27, Hao Liu via Gcc wrote:

Hi,

I'm new to GCC community, and try to contribute some patches.
I am having trouble setting git send-email with Outlook on Linux. Does anyone 
have any successful experiences to share?


I assume from your email address that you're referring to the 
outlook.com webmail (i.e. what was hotmail), rather than a corporate 
Exchange server with the Outlook desktop client.


The webmail settings (gear icon) has a "Forwarding and IMAP" tab. You 
need to enable either POP or IMAP -- it doesn't matter which you choose 
unless you actually want to read email in an external client -- and then 
click "View POP, IMAP and SMTP settings".


The SMTP settings are what git send-email needs to work correctly.

I do not know if your system can support the authentication methods used 
by outlook.com. Git uses a Perl library for this, so the documentation 
is a little vague.


Andrew

Re: [RFC] Enabling SVE with offloading to nvptx

2025-01-02 Thread Andrew Stubbs

On 27/12/2024 12:29, Prathamesh Kulkarni wrote:

-Original Message-
From: Jakub Jelinek 
Sent: 17 December 2024 19:09
To: Prathamesh Kulkarni 
Cc: Andrew Stubbs ; Richard Biener
; Richard Biener ;
gcc@gcc.gnu.org; Thomas Schwinge 
Subject: Re: [RFC] Enabling SVE with offloading to nvptx

External email: Use caution opening links or attachments

On Mon, Dec 02, 2024 at 11:17:08AM +, Prathamesh Kulkarni wrote:

--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -233,6 +233,12 @@ public:
   flag_finite_loops or similar pragmas state.  */
unsigned finite_p : 1;

+  /* True if SIMD loop needs delayed lowering of artefacts like
+ safelen and length of omp simd arrays that depend on target's
+ max_vf.  This is true for offloading, when max_vf is computed

after

+ streaming out to device.  */
+  unsigned needs_max_vf_lowering: 1;

Consistency, finite_p above uses space before :, the above line
doesn't.

--- a/gcc/omp-expand.cc
+++ b/gcc/omp-expand.cc
@@ -7170,6 +7170,10 @@ expand_omp_simd (struct omp_region *region,

struct omp_for_data *fd)

loop->latch = cont_bb;
add_loop (loop, l1_bb->loop_father);
loop->safelen = safelen_int;
+  loop->needs_max_vf_lowering = is_in_offload_region (region);
+  if (loop->needs_max_vf_lowering)
+ cfun->curr_properties &= ~PROP_gimple_lomp_dev;

Do you really need this for non-SVE arches?
I mean, could you not set loop->needs_max_vf_lowering if maximum
number of poly_int coeffs is 1?  Or if omp_max_vf returns constant or
something similar?

Well, I guess the issue is not really about VLA vectors but when host and 
device have
different max_vf, and selecting optimal max_vf is not really possible during 
omp-low/omp-expand,
since we don't have device's target info available at this point. Andrew's 
recent patch works around this
limitation by searching for "amdgcn" in OFFLOAD_TARGET_NAMES in omp_max_vf, but 
I guess a more general solution
would be to delay lowering max_vf after streaming-out to device irrespective of 
VLA/VLS vectors ?
For AArch64/nvptx offloading with SVE, where host is VLA and device is VLS, the 
issue is more pronounced (failing to compile),
compared to offloading from VLS host to VLS device (selecting sub-optimal 
max_vf).

That patch fixed a couple of cases. The name matching was only used for 
the case where an oversized VF was harmless. The other case where making 
the VF too large would reserve excess memory was deferred to the device 
compiler.

In general, deferring decisions is probably a good idea, but it's not 
always possible, or optimal, and in the above case it certainly wasn't 
the easy option. There's already precedent for doing the name match in 
the SIMT VF code (for NVPTX), so it was easier and sufficient to do the 
same.

Andrew

Re: GSoC 2025: In-Memory Filesystem for GPU Offloading Tests

2025-03-11 Thread Andrew Stubbs


On 10/03/2025 22:56, Arijit Kumar Das wrote:

Hello Andrew,

Thank you for the detailed response! This gives me a much clearer 
picture of how things work.


Regarding the two possible approaches:

  * I personally find *Option A (self-contained in-memory FS)* more
interesting, and I'd like to work on it first.

  * However, if *Option B (RPC-based host FS access)* is the preferred
approach for GSoC, I’d be happy to work on that as well.


I'll defer to Thomas who proposed the project and volunteered to act as 
GSoC mentor. :)


Have fun!

Andrew

Re: GSoC 2025: In-Memory Filesystem for GPU Offloading Tests

2025-03-11 Thread Andrew Stubbs


On 10/03/2025 15:37, Arijit Kumar Das via Gcc wrote:

Hello GCC Community!

I am Arijit Kumar Das, a second-year engineering undergraduate from NIAMT
Ranchi, India. While my major isn’t Computer Science, my passion for system
programming, embedded systems, and operating systems has driven me toward
low-level development. Programming has always fascinated me—it’s like
painting with logic, where each block of code works in perfect
synchronization.

The project mentioned in the subject immediately caught my attention, as I
have been exploring the idea of a simple hobby OS for my Raspberry Pi Zero.
Implementing an in-memory filesystem would be an exciting learning
opportunity, closely aligning with my interests.

I have carefully read the project description and understand that the goal
is to modify *newlib* and the *run tools* to redirect system calls for file
I/O operations to a virtual, volatile filesystem in host memory, as the GPU
lacks its own filesystem. Please correct me if I’ve misunderstood any
aspect.


That was the first of two options suggested.  The other option is to 
implement a pass-through RPC mechanism so that the runtime actually can 
access the real host file-system.


Option A is more self-contained, but requires inventing a filesystem and 
ultimately will not help all the tests pass.


Option B has more communication code, but doesn't require storing 
anything manually, and eventually should give full test coverage.


A simple RPC mechanism already exists for the use of printf (actually 
"write") on GCN, but was not necessary on NVPTX (a "printf" text output 
API is provided by the driver).  The idea is to use a shared memory ring 
buffer that the host "run" tool polls while the GPU kernel is running.



I have set up the GCC source tree and am currently browsing relevant files
in the *gcc/testsuite* directory. However, I am unsure *where the run tools
source files are located and how they interact with newlib system calls.*
Any guidance on this would be greatly appreciated so I can get started as
soon as possible!


You'll want to install the toolchain following the instructions at 
https://gcc.gnu.org/wiki/Offloading and try running some simple OpenMP 
target kernels first.  Newlib isn't part of the GCC repo, so if you 
can't find the files then that's probably why!


The "run" tools are installed as part of the offload toolchain, albeit 
hidden under the "libexec" directory because they're really only used 
for testing. You can find the sources with the config/nvptx or 
config/gcn backend files.


User code is usually written using OpenMP or OpenACC, in which case the 
libgomp target plugins serve the same function as the "run" tools. These 
too could use the file-system access, but it's not clear that there's a 
common use-case for that.  The case should at least fail gracefully 
though (as they do now).


Currently, system calls such as "open" simply return EACCESS 
("permission denied") so the stub implementations are fairly easy to 
understand (e.g. newlib/libc/sys/amdgcn/open.c).  The task would be to 
insert new code there that actually does something.  You do not need to 
modify the compiler itself.


Hope that helps

Andrew



Best regards,
Arijit Kumar Das.

*GitHub:* https://github.com/ArijitKD
*LinkedIn:* https://linkedin.com/in/arijitkd

99 matches

Mail list logo