[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2017-04-30 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

strntydog at gmail dot com changed:

   What|Removed |Added

Version|5.2.0   |6.3.1

--- Comment #4 from strntydog at gmail dot com ---
Ok, so i just tested to see if this problem with Cortex M0/M0+ code generation
persists in GCC 6.3.1, which is the latest GCC Binary distributed by the Arm
Embedded folks.  And it does.

To put the Optimisation failure into perspective, this is the difference
between the 6 tests in the test case:

Test 1 - Code Size is 40% Bigger for M0, and the Function is 114% bigger.
Test 2 - Code Size is 20% bigger for M0, and the Function is 44% bigger.
Test 3 - Code Size is same between M0 and M3, but the Function is 43% bigger.
Test 4 - Code Size is 40% Bigger for M0, and the Function is 86% bigger.
Test 5 - Code Size is same between M0 and M3, but the Function is 14% bigger.
Test 6 - Code Size is 38% Bigger for M0, and the Function is 100% bigger.

These are HUGE.  

This means that on average these function will run about 22% slower than they
should and consume 67% more FLASH space than they should. But worst case from
my tests could be over twice as large as they need to be and need 40% more
instructions to achieve the same thing.

This problem is easily shown to occur when accessing memory location at known
addresses, something which microcontroller programs do all the time. This
problem effects every single M0 Application written which is compiled with GCC,
wasting Flash and running slower.

Note: Code Size refers to the number of instructions in the function, and the
function size is the code size plus its Literal data.  Code size is a measure
of performance on the M0, because more instructions means more cycles to
execute. And Function size is a measure of flash wastage.

[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2017-05-01 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

--- Comment #5 from strntydog at gmail dot com ---
I also just calculated the number of cycles each function takes:

Test 1 - 50% More CPU Cycles
Test 2 - 25% More CPU Cycles
Test 3 - 5% More CPU Cycles
Test 4 - 39% More CPU Cycles
Test 5 - 6% More CPU Cycles
Test 6 - 46% More CPU Cycles

This assumes Zero Wait state access to memory, any wait states will make these
differences worse, as the excess cycles are the result of extra flash accesses.

So, even Test 3 and 5 which have the same code size will run ~5% slower than it
should, which is significant.  But the worst cases will be dramatically slower.

This bug leads not only to slower execution, but that has a direct impact on
Power Efficiency and battery life in battery powered devices (which are a
target market for M0/M0+ processors).

These are extremely common and simple memory access patterns, and every single
M0/M0+ program will be negatively effected by it.

[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2017-05-04 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

--- Comment #6 from strntydog at gmail dot com ---
I have built GCC 7.1.0 and have tested this optimization bug against that.  It
persists.  Further, the new target cortx-m23 is affected by the bug, exactly
the same as Cortex M0/M0+ and M1

The new cortex-m33 target behaves the same as the cortex-m3, in that it
produces legal code for the cortex-m23/m0/m0+/m1 but it is much better
optimised.

[Bug rtl-optimization/69460] New: ARM Cortex M0 produces suboptimal code vs Cortex M3

2016-01-24 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

Bug ID: 69460
   Summary: ARM Cortex M0 produces suboptimal code vs Cortex M3
   Product: gcc
   Version: 5.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: strntydog at gmail dot com
  Target Milestone: ---

Created attachment 37451
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37451&action=edit
Test Source

Tested on GCC 4.9 and 5.2, under Linux (Ubuntu 15.10/64 bit)
I am using the pre-built toolchains available at
https://launchpad.net/gcc-arm-embedded

When compiling for a Cortex M0 target i noticed some poor code generation with
regard to literal tables.  I compared that code generation to code generation
to Cortex M3 and it produces much better code.  It became apparent that the
code generated for the M3 was actually legal M0 code and so could execute
unmodified on a M0 core.  Accordingly, the Cortex M0 is needlessly producing
suboptimal code vs the same code compiled for Cortex M3.

There are six tests in the test case, all accessing memory via different
patterns.  ALL generate suboptimal code for Cortex M0 vs the Cortex M3 code
generator, yet all code produced is legal Cortex M0 code. Example of the
sub-optimal code generation:

Test 6:

/* Write 8 bit values to known register locations - using an array */
void test6(void)
{
volatile uint8_t* const r = (uint8_t*)(0x40002800U); // Register Array

r[0] = 0xFF;
r[1] = 0xFE;
r[2] = 0xFD;
r[3] = 0xFC;
r[4] = 0xEE;
r[8] = 0xDD;
r[12] = 0xCC;
}

Which, at -Os for -mcpu-cortex-m0 results in:
00ec :
  ec: 22ff movs r2, #255 ; 0xff
  ee: 4b0a ldr r3, [pc, #40] ; (118 )
  f0: 701a strb r2, [r3, #0]
  f2: 4b0a ldr r3, [pc, #40] ; (11c )
  f4: 3a01 subs r2, #1
  f6: 701a strb r2, [r3, #0]
  f8: 4b09 ldr r3, [pc, #36] ; (120 )
  fa: 3a01 subs r2, #1
  fc: 701a strb r2, [r3, #0]
  fe: 4b09 ldr r3, [pc, #36] ; (124 )
 100: 3a01 subs r2, #1
 102: 701a strb r2, [r3, #0]
 104: 4b08 ldr r3, [pc, #32] ; (128 )
 106: 3a0e subs r2, #14
 108: 701a strb r2, [r3, #0]
 10a: 4b08 ldr r3, [pc, #32] ; (12c )
 10c: 3a11 subs r2, #17
 10e: 701a strb r2, [r3, #0]
 110: 4b07 ldr r3, [pc, #28] ; (130 )
 112: 3a11 subs r2, #17
 114: 701a strb r2, [r3, #0]
 116: 4770 bx lr
 118: 40002800 .word 0x40002800
 11c: 40002801 .word 0x40002801
 120: 40002802 .word 0x40002802
 124: 40002803 .word 0x40002803
 128: 40002804 .word 0x40002804
 12c: 40002808 .word 0x40002808
 130: 4000280c .word 0x4000280c

Each element accessed in the array of bytes has resulted in the address of that
element appearing in the literal table. 

By comparison the M3 build generates :

0094 :
  94: 4b07 ldr r3, [pc, #28] ; (b4 )
  96: 22ff movs r2, #255 ; 0xff
  98: 701a strb r2, [r3, #0]
  9a: 22fe movs r2, #254 ; 0xfe
  9c: 705a strb r2, [r3, #1]
  9e: 22fd movs r2, #253 ; 0xfd
  a0: 709a strb r2, [r3, #2]
  a2: 22fc movs r2, #252 ; 0xfc
  a4: 70da strb r2, [r3, #3]
  a6: 22ee movs r2, #238 ; 0xee
  a8: 711a strb r2, [r3, #4]
  aa: 22dd movs r2, #221 ; 0xdd
  ac: 721a strb r2, [r3, #8]
  ae: 22cc movs r2, #204 ; 0xcc
  b0: 731a strb r2, [r3, #12]
  b2: 4770 bx lr
  b4: 40002800 .word 0x40002800

ALL of which is LEGAL M0 Code.

The Cortex M0 compile is:
72 Bytes Long, 22 Instructions and 7 Literal Table Entries and 7 reads from
Code Space.

The Cortex M3 compile (which generates legal M0 code) is:
36 Bytes Long, 16 Instructions and 1 Literal Table Entry and 1 read from Code
Space.

Significantly more efficient in every respect.

Given that Cortex M0 cores usually have less resources than Cortex M3 cores, I
would expect the code generation to be the same between them, unless there is
an ability to use an instruction which only exists on a Cortex M3.  This
inefficient code generation will make Cortex M0 cores seem much less efficient
and much slower than they are in reality.

Attached is a the test case and a script to build it.

The script builds the code for M0 and M3, it then dumps the M3 assembler,
patches it so that it can be assembled as M0 assembler and assembles the
result.
The reason for that is to confirm that the M3 generated code is LEGAL M0 Code,
which it is.

[Bug rtl-optimization/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2016-01-24 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

--- Comment #1 from strntydog at gmail dot com ---
Created attachment 37452
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37452&action=edit
Script to build the test and generate output files

This script builds the test.c file for both Cortex M0 and M3 it then checks if
the M3 code generated is legal M0 code by trying to assemble the M3 code output
as M0 assembler.

[Bug rtl-optimization/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2016-01-24 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

--- Comment #2 from strntydog at gmail dot com ---
This code generation problem was also reported at:
https://bugs.launchpad.net/gcc-arm-embedded/+bug/1502611