[RFC] NEON vs. ARM register selection
Hi All, As you know, the compiler currently has difficulties choosing between whether to do an operation in NEON or not. As I see it there are three problems: 1. Simply, is it profitable? NEON can do many DImode operations in one or two instructions where 2 to 10 normal ARM/Thumb instructions would be required (not to mention the added register pressure), but there is a cost associated with moving the inputs to NEON, and the results back. If the data can stay in NEON for more than one operation, then that's even better. If the data must be loaded from memory, and the result stored back to memory, then it's only a question of whether the register space is available, or not. Currently these decisions are made in the IRA/reload passes. 2. Values that originate in hard-registers stay there. This applies to function parameters, mostly, but also in general where the result of an operation is allocated first. If there is no instruction that can use the value there then the value is 'reloaded' to a more suitable register. If there is any alternative that avoids the move then the register allocator will use it, regardless of the relatives costs of the other alternatives. This problem is reduced where an operation and move can happen in one instruction, but NEON instructions do not do this much. We can write insns that appear to do it, but these output multiple instructions (see my recent core-SI=>NEON-DI extend patch). 3. It all happens too late. The decision whether to use NEON or not is not made until register allocation time. Naturally this means that most of the optimization passes are already completed. Part of the problem is that the operation almost certainly needs splitting (into whatever form was chosen) and this might not be straight forward, post-reload. (However, the split1 pass is already quite late, so perhaps this isn't such a big deal.) Another part of the problem is that passes such as the two lower-subreg passes make assumptions about the register width which are not accurate if the operation is to end up in NEON. There are other, lesser problems, such as it being hard to adjust the costs for different cores (A8 in particular) and the cost of generating an immediate constant can't be known until it's known what instructions will be used to generate it. These problems are not specific to NEON, of course. I believe IWMMXT suffers from the same issues. Likewise the C6X port, and also the i386 MMX to some degree. Anything that has instructions that only operate on a subset of registers, basically. So, Bernd has suggested an outline of a solution. I've quizzed him on this, added a few of my own ideas, and probably a good selection of misunderstandings, bad assumptions, and general cock ups, and come up with something I can write here for comment. I can post something to upstream later if it doesn't get totally shot down now. The basic idea is that we add a new RTL optimization pass (or two) that assesses the usage of pseudo registers, and makes recommendations about what register class each should end up in, if there's a choice. These recommendations would then be used by later passes to get a better use of NEON. I might call this the "prealloc" pass, or something. Firstly, for each pseudo-register in a function, the pass would look at the insn constraints for each "def" and "use", and see how the registers relate to one another. This might determine things like "if rN is in class A, then rM must be also in class A". E.g. if you have two registers with constraints like this: "r,w" "r,w" .. (and 'r' and 'w' do not overlap) then you know that there is a choice between one mode or another, whereas this: "r,w,r,w" "r,w,w,r" .. would impose no restrictions and we can carry on as normal. Having done that we'd end up with sets of pseudo-registers that must make a decision one way or the other, and we'd know where the operations are that would force a move from one class to the other. There's a fair amount of handwavium in there at present, because I've not worked out what to do with overlapping register classes (think VFP_LO_REGS) and all the other complications. Secondly, the pass would consider the costs of each alternative, and store a recommended register class for each pseudo-register in a table somewhere. It would also create new pseudos and insert extra move instructions at the register file boundaries where an existing register would have had split recommendations (this would solve problem 2 above). Again, there's handwavium in "consider the costs". This isn't too hard for size-optimization (assuming the "length" attributes on the insn is correct), but more difficult for speed optimization. Factors to include would be the move costs (here the A8
Re: Please pull upstream rev. 184603 into gcc-linaro 4.7
On Wed 29 Feb 2012 18:05:46 GMT, Andrew Stubbs wrote: On 29/02/12 17:23, Bernhard Rosenkränzer wrote: Hi, 184603 fixes an ICE we're running into with Android test builds. Please pull it in ASAP so I don't have to mess with the CFLAGS as a workaround. There's a merge from r184662 begun testing today. That should cover your revision. I'll commit it if it's not borked when the tests come back. That'll be tomorrow or Friday, I expect. Now committed. Bero, give it a go and see if it does what you need, please. Andrew ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
lp:gcc/4.7
Hi Matthias, GCC 4.7.0 has branched upstream. SVN trunk is now 4.8. Could you please create lp:gcc/4.7 from the release branch. Thanks Andrew ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] report week 09
Current Milestones: || || Planned|| Estimate || Actual || ||cp15-rework || 2012-01-06 || 2012-??-?? |||| (new blueprints & reestimate for this one pending) Historical Milestones: ||a15-usermode-support || 2011-11-10 || 2011-11-10 || 2011-10-27 || ||upstream-omap3-cleanup|| 2011-11-10 || 2011-12-15 || 2011-12-12 || ||initial-a15-system-model || 2012-01-27 || 2012-01-27 || 2012-01-17 || ||qemu-kvm-getting-started || 2012-03-04?|| 2012-03-04?|| 2012-02-01 || == cp15-rework == * ploughing through conversion of cp15 registers to new design: patchset now 20 patches long, still TODO crn={0,1,6,7,9} == other == * reviewed more Xilinx Zynq model patches * looking at BE8 support: Paul Brook has posted some patches to support this in user mode * LP:944645: fixed bug where we weren't clearing the IT bits when entering an M profile exception handler * sent out an arm-devs.next pullreq * trying to track down why linux-user is failing brk() and thus causing bash segfaults ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: [RFC] NEON vs. ARM register selection
On 2 March 2012 12:29, Andrew Stubbs wrote: > Hi All, > > As you know, the compiler currently has difficulties choosing between > whether to do an operation in NEON or not. > I have put this on the agenda for Tuesday's call - There is a bit of detail here that I haven't digested fully which is why I didn't respond in any detail earlier. Ramana ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Agenda for next performance call
is now here https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2012-03-06 Please add any topics that you might consider interesting for next time. Ramana ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[Activity ] Week 9
==Progress=== * Finished off PGO patch - sent upstream. * Finished off the ABI tests - sent upstream. * Investigated fixes for LP 942307 - a problem with kernel builds for android. Backported a fix from Uli last year. * Upstream patch review. * Small configury done for SPEC2k as far as HC partitioning goes. * Some Android benchmark investigations. * Recovered from a broken upgrade on my laptop from natty to oneiric on my laptop and then went all the way to Precise. It works reasonably ! === Plans === * Commit all approved and tested patches. * Check on hc partitioning results from SPEC2k and make sure there is an improvement and the feature works ! * Investigate https://bugs.launchpad.net/gcc-linaro/+bug/924726 in a little more detail. * Get back to partial-partial PRE. Absences. * 1 week holiday sometime before that - to be booked. * Linaro Connect Q2.12 - May 28 - June 1 - travel booked - hotel to be booked. ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] Feb 27 - Mar 2
== GCC == * Fixed mainline regression causing ICE in certain outer-loop vectorization cases. * Merged fwprop-subreg patch into Linaro GCC 4.7. * Completed patch to generate usat/ssat instructions where appropriate; checked into GCC mainline. Merge requests to Linaro GCC 4.6 and 4.7 pending. * Ongoing work on improving end-of-loop value computation. Mit freundlichen Gruessen / Best Regards Ulrich Weigand -- Dr. Ulrich Weigand | Phone: +49-7031/16-3727 STSM, GNU compiler and toolchain for Linux on System z and Cell/B.E. IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter | Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht Stuttgart, HRB 243294 ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] weekly status
Hi, OpenEmbedded: * Worked on the meta-linaro layer and added libgcc and crosssdk recipes to satisfy some bitbake dependencies * I had to apply a few patches to build the linaro toolchain the OE way (mostly gcc configury) * successfully built the sato and Qt images * Moved on to test the February release of the linaro binary toolchain and (probably) and hit an issue with unaligned SD card images to used with QEMU * the guest kernel fails with: attempt to access beyond end of device * /proc/partitions shows different block sizes (host vs. guest) * the image size gets calculated on the fly by OE * patch posted that introduces allows to specify a rootfs size alignment * not seen on trunk as they use IDE * Started to rebase the linaro-meta layer against current OE-core * created https://wiki.linaro.org/KenWerner/Sandbox/OEMetaLinaroCard based on the existent card of David R. Regards, Ken ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain