Re: dom1 prevents vectorization via partial loop peeling?

2015-04-28 Thread Richard Biener
On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law  wrote:
> On 04/27/2015 10:12 AM, Alan Lawrence wrote:
>>
>>
>> After copyrename3, immediately prior to dom1, the loop body looks like:
>>
>>:
>>
>>:
>># i_11 = PHI 
>>_5 = a[i_11];
>>_6 = i_11 & _5;
>>if (_6 != 0)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>
>>:
>># m_2 = PHI <5(4), 4(3)>
>>_7 = m_2 * _5;
>>b[i_11] = _7;
>>i_9 = i_11 + 1;
>>if (i_9 != 32)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>goto ;
>>
>>:
>>return;
>>
>> dom1 then peeled part of the first loop iteration, producing:
>
> Yup.  The jump threading code realized that if we traverse the edge 2->3,
> then we'll always traverse 3->5.  The net effect is like peeling the first
> iteration because we copy bb3.  The original will be reached via 7->3 (it,
> loop iterating), the copy via 2->3' and 3' will have its conditional removed
> and will unconditionally transfer control to bb5.
>
>
>
> This is a known problem, but we really don't have any good heuristics for
> when to do this vs when not to do it.
>
>
>>
>> In contrast, a slightly-different testcase:
>>
>> #define N 32
>>
>> int a[N];
>> int b[N];
>>
>> int foo ()
>> {
>>for (int i = 0; i < N ; i++)
>>  {
>>int cond = (a[i] & i) ? -1 : 0; // extra variable here
>>int m = (cond) ? 5 : 4;
>>b[i] = a[i] * m;
>>  }
>> }
>>
>> after copyrename3, just before dom1, is only slightly different:
>>
>>:
>>
>>:
>># i_15 = PHI 
>>_6 = a[i_15];
>>_7 = i_15 & _6;
>>if (_7 != 0)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>># m_3 = PHI <4(6), 5(3)>
>>_8 = m_3 * _6;
>>b[i_15] = _8;
>>i_10 = i_15 + 1;
>>if (i_10 != 32)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>goto ;
>>
>>:
>>return;
>>
>>:
>>goto ;
>>
>> with bb6 being out-of-line at the end of the function, rather than bb4
>> falling through from just above bb5. However, this prevents dom1 from
>> doing the partial peeling, and dom1 only changes the "goto bb7" into a
>> "goto bb3":
>
> I would have still expected it to thread 2->3, 3->6->4
>
>
>>
>> (1) dom1 should really, in the second case, perform the same partial
>> peeling that it does in the first testcase, if that is what it thinks is
>> desirable. (Of course, we might want to fix that only later, as ATM
>> that'd take us backwards).
>
> Please a file a BZ.  It could be something simple, or we might be hitting
> one of Zdenek's heuristics around keeping overall loop structure.
>
>>
>> Alternatively, maybe we don't want dom1 doing that sort of thing (?),
>> but I'm inclined to think that if it's doing such optimizations, it's
>> for a good reason ;) I guess there'll be other times where we *cannot*
>> do partially peeling of later iterations...
>
> It's an open question -- we never reached any kind of conclusion when it was
> last discussed with Zdenek.  I think the fundamental issue is we can't
> really predict when threading the loop is going to interfere with later
> optimizations or not.  The heuristics we have are marginal at best.
>
> The one thought we've never explored was re-rolling that first iteration
> back into the loop in the vectorizer.

Well.  In this case we hit

  /* If one of the loop header's edge is an exit edge then do not
 apply if-conversion.  */
  FOR_EACH_EDGE (e, ei, loop->header->succs)
if (loop_exit_edge_p (loop, e))
  return false;

which is simply because even after if-conversion we'll at least end up
with a non-empty latch block which is what the vectorizer doesn't support.
DOM rotated the loop into this non-canonical form.  Running loop header
copying again would probably undo this.

Richard.

>
> Jeff


Re: missing explanation of Stage 4 in GCC Development Plan document

2015-04-28 Thread Richard Biener
On Tue, Apr 28, 2015 at 7:01 AM, Thomas Preud'homme
 wrote:
>> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On
>> Behalf Of James Greenhalgh
>
> Hi James,
>
>>
>> The stages, timings, and exact rules for which patches are acceptable
>> and when, seem to have drifted quite substantially from that page.
>> Stage 2 has been missing for 7 years now, Stages 3 and 4 seem to blur
>> together, the "regression only" rule is more like "non-invasive fixes
>> only" (likewise for the support branches).
>
> Don't stage3 and stage4 differ in that substantial changes are still allowed
> for backends in stage3?

stage3 is for _general_ bugfixing while stage4 is for _regression_ bugfixing.

Richard.

>
>>
>> So, why not try to reflect practice and form a two stage model (and
>> name the stages in a descriptive fashion)?
>>
>>   Development:
>>
>>   Expected to last for around 70% of a release cycle. During this
>>   period, changes of any nature may be made to the compiler. In
>> particular,
>>   major changes may be merged from branches. In order to avoid chaos,
>>   the Release Managers will ask for a list of major projects proposed for
>>   the coming release cycle before the start of this stage. They will
>>   attempt to sequence the projects in such a way as to cause minimal
>>   disruption. The Release Managers will not reject projects that will be
>>   ready for inclusion before the end of the development phase. Similarly,
>>   the Release Managers have no special power to accept a particular
>>   patch or branch beyond what their status as maintainers affords.
>>   The role of the Release Managers is merely to attempt to order the
>>   inclusion of major features in an organized manner.
>>
>>   Stabilization:
>>
>>   Expected to last for around 30% of a release cycle. New functionality
>>   may not be introduced during this period. Changes during this phase
>>   of the release cycle should focus on preparing the trunk for a high
>>   quality release, free of major regression and code generation issues.
>>   As we near the end of a release cycle, changes will only be accepted
>>   where they fix a regression, or are sufficiently non-intrusive as to
>>   not introduce a risk of affecting the quality of the release.
>
> If we keep referring to stages in our communication it would be nice to
> document them. I'm not saying this rewording is wrong, I just think we
> should add 1-2 sentences to explain the stages (I know it confused me
> at first because stage4 was not listed). Alternatively we could just
> refer to these 2 names only (development and stabilization).
>
> Best regards,
>
> Thomas
>
>


RE: dom1 prevents vectorization via partial loop peeling?

2015-04-28 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard 
Biener
Sent: Tuesday, April 28, 2015 4:12 PM
To: Jeff Law
Cc: Alan Lawrence; gcc@gcc.gnu.org
Subject: Re: dom1 prevents vectorization via partial loop peeling?

On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law  wrote:
> On 04/27/2015 10:12 AM, Alan Lawrence wrote:
>>
>>
>> After copyrename3, immediately prior to dom1, the loop body looks like:
>>
>>:
>>
>>:
>># i_11 = PHI 
>>_5 = a[i_11];
>>_6 = i_11 & _5;
>>if (_6 != 0)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>
>>:
>># m_2 = PHI <5(4), 4(3)>
>>_7 = m_2 * _5;
>>b[i_11] = _7;
>>i_9 = i_11 + 1;
>>if (i_9 != 32)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>goto ;
>>
>>:
>>return;
>>
>> dom1 then peeled part of the first loop iteration, producing:
>
> Yup.  The jump threading code realized that if we traverse the edge 
> 2->3, then we'll always traverse 3->5.  The net effect is like peeling 
> the first iteration because we copy bb3.  The original will be reached 
> via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its 
> conditional removed and will unconditionally transfer control to bb5.
>
>
>
> This is a known problem, but we really don't have any good heuristics 
> for when to do this vs when not to do it.
>
>
>>
>> In contrast, a slightly-different testcase:
>>
>> #define N 32
>>
>> int a[N];
>> int b[N];
>>
>> int foo ()
>> {
>>for (int i = 0; i < N ; i++)
>>  {
>>int cond = (a[i] & i) ? -1 : 0; // extra variable here
>>int m = (cond) ? 5 : 4;
>>b[i] = a[i] * m;
>>  }
>> }
>>
>> after copyrename3, just before dom1, is only slightly different:
>>
>>:
>>
>>:
>># i_15 = PHI 
>>_6 = a[i_15];
>>_7 = i_15 & _6;
>>if (_7 != 0)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>># m_3 = PHI <4(6), 5(3)>
>>_8 = m_3 * _6;
>>b[i_15] = _8;
>>i_10 = i_15 + 1;
>>if (i_10 != 32)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>goto ;
>>
>>:
>>return;
>>
>>:
>>goto ;
>>
>> with bb6 being out-of-line at the end of the function, rather than 
>> bb4 falling through from just above bb5. However, this prevents dom1 
>> from doing the partial peeling, and dom1 only changes the "goto bb7" 
>> into a "goto bb3":
>
> I would have still expected it to thread 2->3, 3->6->4
>
>
>>
>> (1) dom1 should really, in the second case, perform the same partial 
>> peeling that it does in the first testcase, if that is what it thinks 
>> is desirable. (Of course, we might want to fix that only later, as 
>> ATM that'd take us backwards).
>
> Please a file a BZ.  It could be something simple, or we might be 
> hitting one of Zdenek's heuristics around keeping overall loop structure.
>
>>
>> Alternatively, maybe we don't want dom1 doing that sort of thing (?), 
>> but I'm inclined to think that if it's doing such optimizations, it's 
>> for a good reason ;) I guess there'll be other times where we 
>> *cannot* do partially peeling of later iterations...
>
> It's an open question -- we never reached any kind of conclusion when 
> it was last discussed with Zdenek.  I think the fundamental issue is 
> we can't really predict when threading the loop is going to interfere 
> with later optimizations or not.  The heuristics we have are marginal at best.
>
> The one thought we've never explored was re-rolling that first 
> iteration back into the loop in the vectorizer.

>>Well.  In this case we hit


  >>/* If one of the loop header's edge is an exit edge then do not
>> apply if-conversion.  */
  >>FOR_EACH_EDGE (e, ei, loop->header->succs)
   >> if (loop_exit_edge_p (loop, e))
 >> return false;

>>which is simply because even after if-conversion we'll at least end up with a 
>>non-empty latch block which is what the vectorizer doesn't support.
>>DOM rotated the loop into this non-canonical form.  Running loop header 
>>copying again would probably undo this.

The creation of empty latches with the path-splitting approach where the back 
edge node can be copied to the predecessors and the 
Empty latch can be created with the path-splitting approach I have proposed. 
This will enable the above scenario of vectorization.

Thanks & Regards
Ajit

Richard.

>
> Jeff


Re: dom1 prevents vectorization via partial loop peeling?

2015-04-28 Thread Alan Lawrence

Ajit Kumar Agarwal wrote:


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard 
Biener
Sent: Tuesday, April 28, 2015 4:12 PM
To: Jeff Law
Cc: Alan Lawrence; gcc@gcc.gnu.org
Subject: Re: dom1 prevents vectorization via partial loop peeling?

On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law  wrote:

On 04/27/2015 10:12 AM, Alan Lawrence wrote:


After copyrename3, immediately prior to dom1, the loop body looks like:

   :

   :
   # i_11 = PHI 
   _5 = a[i_11];
   _6 = i_11 & _5;
   if (_6 != 0)
 goto ;
   else
 goto ;

   :

   :
   # m_2 = PHI <5(4), 4(3)>
   _7 = m_2 * _5;
   b[i_11] = _7;
   i_9 = i_11 + 1;
   if (i_9 != 32)
 goto ;
   else
 goto ;

   :
   goto ;

   :
   return;

dom1 then peeled part of the first loop iteration, producing:
Yup.  The jump threading code realized that if we traverse the edge 
2->3, then we'll always traverse 3->5.  The net effect is like peeling 
the first iteration because we copy bb3.  The original will be reached 
via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its 
conditional removed and will unconditionally transfer control to bb5.




This is a known problem, but we really don't have any good heuristics 
for when to do this vs when not to do it.


Ah, yes, I'd not realized this was connected to the jump-threading issue, but I 
see that now. As you say, the best heuristics are unclear, and I'm not keen on 
trying *too hard* to predict what later phases will/won't do or do/don't 
want...maybe if there are simple heuristics that work, but I would aim more at 
making later phases work with what(ever) they might get???


One (horrible) possibility that I will just throw out (and then duck), is to do 
something akin to tree-if-conversion's "gimple_build_call_internal 
(IFN_LOOP_VECTORIZED, " ...



In contrast, a slightly-different testcase:

>>> [snip]

I would have still expected it to thread 2->3, 3->6->4


Ok, I'll look into that.

(1) dom1 should really, in the second case, perform the same partial 
peeling that it does in the first testcase, if that is what it thinks 
is desirable. (Of course, we might want to fix that only later, as 
ATM that'd take us backwards).
Please a file a BZ.  It could be something simple, or we might be 
hitting one of Zdenek's heuristics around keeping overall loop structure.


Alternatively, maybe we don't want dom1 doing that sort of thing (?), 
but I'm inclined to think that if it's doing such optimizations, it's 
for a good reason ;) I guess there'll be other times where we 
*cannot* do partially peeling of later iterations...
It's an open question -- we never reached any kind of conclusion when 
it was last discussed with Zdenek.  I think the fundamental issue is 
we can't really predict when threading the loop is going to interfere 
with later optimizations or not.  The heuristics we have are marginal at best.


The one thought we've never explored was re-rolling that first 
iteration back into the loop in the vectorizer.


Yeah, there is that ;).

So besides trying to partially-peel the next N iterations, the other approach - 
that strikes me as sanest - is to finish (fully-)peeling off the first 
iteration, and then to vectorize from then on. In fact the ideal (I confess I 
have no knowledge of the GCC representation/infrastructure here) would probably 
be for the vectorizer (in vect_analyze_scalar_cycles) to look for a point in the 
loop, or rather a 'cut' across the loop, that avoids breaking any non-cyclic 
use-def chains, and to use that as the loop header. That analysis could be quite 
complex tho ;)...and I can see that having peeled the first 1/2 iteration, we 
may then end up having to peel the next (vectorization factor - 1/2) iterations 
too to restore alignment!


whereas with rerolling ;)...is there perhaps some reasonable way to keep markers 
around to make the rerolling approach more feasible???



Well.  In this case we hit



  >>/* If one of the loop header's edge is an exit edge then do not
>> apply if-conversion.  */
  >>FOR_EACH_EDGE (e, ei, loop->header->succs)
   >> if (loop_exit_edge_p (loop, e))
 >> return false;


which is simply because even after if-conversion we'll at least end up with a 
non-empty latch block which is what the vectorizer doesn't support.
DOM rotated the loop into this non-canonical form.  Running loop header copying 
again would probably undo this.


So I've just posted https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01745.html 
which fixes this limitation of if-conversion. As I first wrote though, the 
vectorizer still fails, because the PHI nodes incoming to the loop header are 
neither reductions nor inductions.


I'll see if I can run loop header copying again, as you suggest...

The creation of empty latches with the path-splitting approach where the back edge node can be copied to the predecessors and the 
Empty latch can be created with the path-splitting approach I have proposed. This 

5.1.0/4.9.2 native mingw64 lto-wrapper.exe issues (PR 65559 and 65582)

2015-04-28 Thread Matt Breedlove
I was told I should repost this on this ML rather than the gcc-help
list I originally posted this under.  Here was my original thread:

https://gcc.gnu.org/ml/gcc-help/2015-04/msg00167.html

I came across PR 65559 and 65582 while investigating why I was getting
the "lto1.exe: internal compiler error: in read_cgraph_and_symbols, at
lto/lto.c:2947" error during a native MINGW64 LTO build.  This also
seems to be present when enabling bootstrap-lto within 5.1.0
presenting an error message akin to what is listed in PR 65582.

1.

Under:
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/lto-wrapper.c;h=404cb68e0d1f800628ff69b7672385b88450a3d5;hb=HEAD#l927

lto-wrapper processes command-line params for filenames match (in my
case) "./.libs/libspeexdsp.a@0x44e26" and separates the filename from
the offset into separate variables.  Since the following check to see
if that file exists by opening it doesn't use the parsed filename
variable and instead continues to use the argv parameter, the attempt
to open it always fails and that file is not specifically parsed for
LTO options.


2.

One other issue I've noticed in my build happens as a result of the
open call when trying to parse the options using libiberty.  Under
mingw64 native, the open call opens the object file in text mode and
then passes the fd eventually to libiberty's
simple_object_internal_read within simple-object.c.  The issue springs
up trying to perform a read and it hits a CTRL+Z (0x1A) within the
object at which point the next read will return 0 bytes and trigger
the break of the loop and a subsequent error message of "file too
short" which gets silently ignored.  In my testing, changing the 0x1A
within the object file to something else returns the full read (or
more data until another CTRL+Z is hit).

Ref: https://msdn.microsoft.com/en-us/library/wyssk1bs.aspx

This still happens within 4.9.2 and 4.9 trunk however in 4.9, the
object file being checked for LTO sections is still passed along in
the command-line whereas in 5.1.0 it gets skipped but is still listed
within the res file most likely leading to the ICE within 65559.  This
would also explain Kai's comment on why this issue only occurs on
native builds.  The ICE in 5.1.0 can also be avoided by using an
lto-wrapper from 4.9 or prior allowing the link to complete though no
LTO options will get processed due to #1.


This is my first report so I wouldn't mind some guidance.  I'm
familiar enough with debugging to gather whatever other level details
are requested.  Most of this was found using gdb.

--
Matt Breedlove


Re: PR65416, alloca on xtensa

2015-04-28 Thread Florian Weimer
On 03/13/2015 06:04 PM, Marc Gauthier wrote:

> Other than the required 16-byte stack alignment, there's nothing in
> the ABI that requires these extra 16 bytes.  Perhaps there was a bad
> implementation of the alloca exception handler at some point a long
> time ago that prompted the extra 16 bytes?

What's the alignment of max_align_t on this architecture?

Although it should be possible to get a 16-byte aligned 1-byte object in
any (however misaligned) 16-byte window on the stack …

-- 
Florian Weimer / Red Hat Product Security


Running GCC testsuite with --param option (requires space in argument)

2015-04-28 Thread Steve Ellcey
Has anyone run the GCC testsuite using a --param option?  I am trying
to do something like:

export RUNTESTFLAGS='--target_board=multi-sim/--param foo=1'
make check

But the space in the '--param foo=1' option is causing dejagnu to fail.
Perhaps there is a way to specify a param value without a space in the
option?  If there is I could not find it.

I tried:

export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1'
export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1'

But neither of those worked either.

Steve Ellcey
sell...@imgtec.com


Re: Running GCC testsuite with --param option (requires space in argument)

2015-04-28 Thread Jakub Jelinek
On Tue, Apr 28, 2015 at 01:55:42PM -0700, Steve Ellcey  wrote:
> Has anyone run the GCC testsuite using a --param option?  I am trying
> to do something like:
> 
> export RUNTESTFLAGS='--target_board=multi-sim/--param foo=1'
> make check
> 
> But the space in the '--param foo=1' option is causing dejagnu to fail.
> Perhaps there is a way to specify a param value without a space in the
> option?  If there is I could not find it.
> 
> I tried:
> 
> export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1'
> export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1'

Have you tried
export RUNTESTFLAGS='--target_board=multi-sim/--param=foo=1'
?

Jakub


Re: Running GCC testsuite with --param option (requires space in argument)

2015-04-28 Thread Steve Ellcey
On Tue, 2015-04-28 at 22:58 +0200, Jakub Jelinek wrote:

> > I tried:
> > 
> > export RUNTESTFLAGS='--target_board=multi-sim/--param\ foo=1'
> > export RUNTESTFLAGS='--target_board=multi-sim/--param/foo=1'
> 
> Have you tried
> export RUNTESTFLAGS='--target_board=multi-sim/--param=foo=1'
> ?
> 
>   Jakub

Nope, but it seems to work.  That syntax is not documented in
invoke.texi.  I will see about submitting a patch (or at least a
documentation bug report).

Steve Ellcey



gcc-5-20150428 is now available

2015-04-28 Thread gccadmin
Snapshot gcc-5-20150428 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/5-20150428/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 5 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-5-branch 
revision 222550

You'll find:

 gcc-5-20150428.tar.bz2   Complete GCC

  MD5=6068bb8e23caa1172a127026e05ed311
  SHA1=7a274cf30fdf3aa1bc347be68787212e1c90ac7d

Diffs from 5-20150421 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-5
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


avr-gcc generating really dumb code

2015-04-28 Thread Ralph Doncaster
I wrote a small function to convert u8 to hex:
// converts 4-bit nibble to ascii hex
uint8_t nibbletohex(uint8_t value)
{
if ( value > 9 ) value += 'A' - '0';
return value + '0';
}

// returns value as 2 ascii characters in a 16-bit int
uint16_t u8tohex(uint8_t value)
{
uint16_t hexdigits;

uint8_t hidigit = (value >> 4);
hexdigits = (nibbletohex(hidigit) << 8);

uint8_t lodigit = (value & 0x0F);
hexdigits |= nibbletohex(lodigit);

return hexdigits;
}

I compiled it with avr-gcc -Os using 4.8 and 5.1 and got the same code:
007a :
  7a:   28 2f   mov r18, r24
  7c:   22 95   swapr18
  7e:   2f 70   andir18, 0x0F   ; 15
  80:   2a 30   cpi r18, 0x0A   ; 10
  82:   08 f0   brcs.+2 ; 0x86 
  84:   2f 5e   subir18, 0xEF   ; 239
  86:   20 5d   subir18, 0xD0   ; 208
  88:   30 e0   ldi r19, 0x00   ; 0
  8a:   32 2f   mov r19, r18
  8c:   22 27   eor r18, r18
  8e:   8f 70   andir24, 0x0F   ; 15
  90:   8a 30   cpi r24, 0x0A   ; 10
  92:   08 f0   brcs.+2 ; 0x96 
  94:   8f 5e   subir24, 0xEF   ; 239
  96:   80 5d   subir24, 0xD0   ; 208
  98:   a9 01   movwr20, r18
  9a:   48 2b   or  r20, r24
  9c:   ca 01   movwr24, r20
  9e:   08 95   ret

There's some completely pointless code there, like loading 0 to r19
and immediately overwriting it with the contents of r18 (88, 8a).
Other register use is convoluted.  The compiler should at least be
able to generate the following code (5 fewer instructions):
28 2f   mov r18, r24
22 95   swapr18
2f 70   andir18, 0x0F   ; 15
2a 30   cpi r18, 0x0A   ; 10
08 f0   brcs.+2 ; 0x86 
2f 5e   subir18, 0xEF   ; 239
20 5d   subir18, 0xD0   ; 208
32 2f   mov r25, r18
8f 70   andir24, 0x0F   ; 15
8a 30   cpi r24, 0x0A   ; 10
08 f0   brcs.+2 ; 0x96 
8f 5e   subir24, 0xEF   ; 239
80 5d   subir24, 0xD0   ; 208
08 95   ret

Hand-optimized for size I was able to write it with 3 fewer instructions:
.macro addi Rd, K
subi \Rd, -(\K)
.endm

.global u8tohex
u8tohex:
mov r0, r24
swap r24
rcall nibbletohex   ; convert hi digit
mov r25, r24
mov r24, r0
; fall into nibbletohex to convert lo digit

; convert lower nibble of byte to ascii hex char
nibbletohex:
andi r24, 0x0F
cpi r24, 10
brlo under10
addi r24, 'A'-'0';
under10:
addi r24, '0'
ret

I have no intention of learning Gimple and the internals of the gcc
back-end, but maybe someone from Atmel can fix this?