A9 Neon confusion
Hi, I've been looking at some basic libc routine optimisation and have a curious problem with memset and wondered if anyone can offer some insights. Some graphs and links to code are on https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset I've written a simple memset in both a with and without Neon variety and tested them on a Beagle(C4) and a Panda board and I'm finding that the Neon version is faster than the non-neon version (a bit) on the Beagle but a LOT slower on the Panda - and I'd like to understand why it's slower than the non-neon version - I'm guessing it's some form of cache interaction. The graphs on that page are all generated by timing a loop that repeatedly memsets the same area of memory; the X axis is the size of the memset. Prior to the test loop the area is read into cache (I came to the conclusion the A8 didn't write allocate?). There are two variants of the graphs - absolute in MB/s on Y, and a relative set (below the absolute) that are relative to the performance of the libc routines. (The ones below those pairs are just older versions). if you look at the top left graph on that page you can see that on the Beagle (left) my Neon routine beats my Thumb routine a bit (both beating libc). If you look on the top right you see the Panda performance with my Thumb code being the fastest and generally following libc, but the Neon code (red line) topping out at about 2.5GB/s which is substantially below the peak of the libc and ARM code. The core loop of the Neon code (see the bzr link for the full thing) is: 4: subs r4,r4,#32 vst2.8 {d0,d1,d2,d3}, [ r3:256 ]! bne 4b while the core of the non-Neon version is: 4: subs r4,r4,#16 stmia r3!,{r1,r5,r6,r7} bne 4b I've also tried vst1 and vstm in the neon loop and it still won't match the non-Neon version. All suggestions welcome, plus I'd appreciate if anyone can suggest which particular limit it's hitting - does anyone have figures for the theoretical bus and L1 and L2 write bandwidths for a Panda (and Beagle) ? Thanks in advance, Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2010-11-19
Short week. Finally got external hard drive for my beagle - makes it sanely possible to natively build things. Got eglibc cross built (Thanks to Wookey for pointing me in the right direction with the magic incantation of dpkg-buildpackage -aarmel --target=binary) and easily rebuilding . I have a version with the neon version of my memset built into it - it doesn't seem to make a noticeable difference to my ghostscript benchmark though. Panda's aren't likely to turn up until mid December; arranging borrowing an A9 is turning out to be difficult, but it looks like we should be able to get access to the one in the London datacentre - although it has a disc problem at the moment. I did manage to get a colleague to try my tests on his own Toshiba AC-10 (Tegra-2 - no Neon); the graphs had approximately the same shape as my previous Panda tests. Memchr looked pretty good on there. Also trying to look at the sign off I need for various libc access. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Of instruction timings
Hi Richard, As per the discussion at this mornings call; I've reread the TRM and I agree with you about the LSLS being the same speed as the TST. (1 cycle) However as we agreed, the uxtb does look like 2 cycles v the AND 1 cycle. On the space v perf theme, one thing that would be interesting to know is whether there are any icache/issue stage limitations; i.e. if I have a stream of 32-bit Thumb-2 instructions that are all listed as 1 cycle and are all in i-cache, can they be fetched and issued fast enough, or is there a performance advantage to short instructions? Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2010-11-26
Hand crafted a simple strchr and comparing it with Libc: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr It's interesting it's significantly faster than libc's on A9's, but on A8's it's slower for large sizes. I've not really looked why yet; my implementation is just the absolute simplest thumb-2 version. Did some ltrace profiling to see what typical strchr and strlen sizes were, and got a bit surprised at some of the typical behaviours (Lots of cases where strchr is being used in loops to see if another string contains anyone of a set of characters, a few cases of strchr being called with Null strings, and the corner case in the spec that allows you to call strchr with \0 as the character to search for). Trying some other benchmarks (pybench spends very little time in libc,package builds of simple packages seem to have a more interesting mix of libc use). Sorting out some of the red tape for contributing. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: GCC Optimization Brain Storming Session
On 29 November 2010 00:18, Michael Hope wrote: > > To add to the mix: > > Some ideas that are logged as blueprints: > Using ARMv5 saturated instructions > (https://blueprints.launchpad.net/gcc-linaro/+spec/armv5-saturated-ops) > Using ARMv6 SIMD instructions > (https://blueprints.launchpad.net/gcc-linaro/+spec/armv6-simd) > Those are quite nice instructions; certainly they seem useful for string ops of various types if misued creatively. > Using ARMv7 unaligned accesses > (https://blueprints.launchpad.net/gcc-linaro/+spec/unaligned-accesses) > Changing the built-in memcpy to use unaligned > (https://blueprints.launchpad.net/gcc-linaro/+spec/unaligned-memcpy) > The interesting challenge here is figuring out how expensive unaligned's are and if the cost trade offs are the same on different chips. The following areas have been suggested. I don't know if they're still valid: > > Register allocator: The register allocator is designed around the > needs of architectures with a low register count and restrictive > register classes. The ARM architecture has many general purpose > registers. Different assumptions may give better code. > > Conditional instructions: The ARM and, to a lesser extent, Thumb-2 > ISAs allow conditional execution of instructions. This can be used in > many situations to eliminate an expensive branch. The middle end > expands and transforms branches. The ARM backend tries to recombine > the RTL back into conditional instructions, but often can't due to the > middle end transforms. > GCC is quite creative in avoiding branches by doing lots of masking and logic; it'll be interesting how much this has to gain. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2010-12-03
* Benchmarking of simple package builds with various string routine versions; not finding enough difference in the noise to make any large conclusions * Looking at the string routine behaviour with perf to see where the time is going - getting hit by the Linaro kernels on silverbell missing Perf enablement in the config - Useful amount of time does seem to be spent outside the main 'fast aligned' chunks of code - pushing/popping registers does seem to be pretty expensive * Started looking at libffi and hard float - Started writing a spec https://wiki.linaro.org/WorkingGroups/ToolChain/Specs/LibFFI-variadic - It's going to need an API change to libffi, although the change shouldn't break any existing code on existing platforms where they work. * Helping with the image testing Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Hard float chroot
Hi, As mentioned on the standup, I just got an armhf chroot going, thanks to markos for pointing me at using multistrap I put the following in a armhfmultistrap.conf and did multistrap -f armhfmultistrap.conf Once that's done, chroot in and then do dpkg --configure -a it's pretty sparse in there, but it's enough to get going. Dave == [General] arch=armhf directory=/discs/more/armhf cleanup=true noauth=true unpack=true explicitsuite=false aptsources=unstable unreleased bootstrap=unstable unreleased [unstable] packages= source=http://ftp.de.debian.org/debian-ports/ keyring=debian-archive-keyring suite=unstable omitdebsrc=true [unreleased] packages= source=http://ftp.de.debian.org/debian-ports/ keyring=debian-archive-keyring suite=unreleased omitdebsrc=true ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Silverbell
Hi, Those of you use silverbell may be glad to know it's back up. Be a little careful, if you shovel large amounts of stuff over it's network the network tends to disappear. (Not sure if this is hardware or driver) Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2010-12-09
Mostly more working with libffi; swapping some ideas back and forwards with Marcus Shawcroft and it looks like we have a good way forward. Got an armhf chroot going, libffi built. Got a testcase failing as expected. Trying to look at other processors ABIs to understand why varargs works for anyone else. Cut through one layer of red tape; can now do the next level of comparison in the string routine work. Started looking at SPEC; hit problems with network stability on VExpress (turns out to be bug 673820) long long weekend; short weeks=2; Back in on Tuesday. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Profile guided and string routines?
Does anyone have any experience of what can be profiled in the profiled guided optimisations? One of the problems with some of the string routines is that you can write pretty neat fast routines that work well for long strings - but most of the calls actually pass short strings and the overhead of the fast routine means that for most cases you are slower than you would have been with a simple routine. If Profile Guiding could spot that a particular callsite to say strlen() was often associated with strings of at least 'n' characters we could call a different implementation. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2010-12-17
Got SPEC2006 building on Silverbell (VExpress) and Canis1 (Orion). There are still some issues; The builds are still going (6 hours so far on a 1GHz A9 for a build and 'test' case), and the Silverbell one has hit an ICE on one of the tests that looks like 635409, and also looks like it needs some help getting Perl to work. The build on Canis has only just started, but hasn't got Fortran installed. (The SPEC2006 tools build also failed in the Perl testsuite on sprintf.t and sprintf2.t which seem to test integer overflow cases in sprintf % fields) Added a few of the kernel string/memory routines and bionic routines into my string/memory graphs and also ran the tests on the Orion board (similar to other A9 performance - no surprise). Wrote up a draft of an email to libffi-dev describing the varargs state; and as I was doing it realised that one of the ways didn't quite work and was more messy. Using rdepends to find all packages using ffi, need to figure out if any actually care about varargs. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2010-12-23
Continued looking at SPEC 2006. The two ICEs I mentioned last week are gone on the Natty version of the compiler, however the 4 programs that run and give the wrong results still happen with the Natty version and the latest version from bzr. The 4 failures are: h264ref - still fails on bzr 99447 with -O2 or -O0 sphinx3 - still fails on bzr 99447 with -O2 or -O0 gromacs - still fails on bzr 99447 with -O2 but works with -O1; I've followed this through and detailed it in bug 693502; it looks to me like a post-increment gone wrong (it's split so it's not actually a post increment and the original rather than post inc'd value gets used) zeusmp - this fails to load the binary; it's got a >1GB bss section. Interestingly it gets further on my beagle with less memory but a bit of swap, even though I think it's not really using all of the BSS in the config I'm using. I'm hoping to leave a 'ref' run going over the new year. The canis1 Orion board I was also running Spec on last weekend died during the run and hasn't come back. perf We now have silverberry using the -proposed kernel which has the fixed PERF_EVENT config, and perf seems to work fine. libffi I've started building the page https://wiki.linaro.org/WorkingGroups/ToolChain/FFIusers listing things that use FFI; (generated by a bit of apt wrangling). There are basically 3 sets: a) Apps that just use ffi for something specific b) Languages that then let the users of those languages have varying degrees of freedom in themselves c) Haskell - While some of the packages are actually probably ffi users, I think a lot of these are false dependencies; almost every haskell user seems to gain a dependency on libffi directly. I'm back on the 4th January. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-01-07
Got h264ref working in SPEC; this was another signed-char issue (very difficult one to find - it didn't crash, it just came out with subtly the wrong result); compiling that and sphinx3 with -fsigned-char and they seem happy. With Richard's fix for gromacs that leaves just the zeusmp binary that's too large to run on silverberry; it seems to startup on canis1 (that has more RAM). So that should be a full set; I just need to get the fortran stuff going on canis1. Kicked off discussion on libffi-discuss about variadic calls; people seem OK with the idea of adding it (although unsure exactly how many things really use it - even though there are examples in Python documentation). Also kicked off some of the internal paperwork to contribute code for it. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-01-14
Got a complete run of SPEC Train on Orion board - all working. Kicked off a SPEC ref run - canis1 died. Gathered a full set of 'perf record's for all of SPEC on silverbell; and had a quick look through them; there aren't too many surprises; a few things that might be worth a look at though. (Not as much using libc functions as I hoped). There are some odd bits - chunks of samples landing apparently outside libraries that aren't obvious what's going on. Sent tentative patch for Thumb perf annotate issue (bug 677547) to lkml for comments. Started on libffi variadic fixing. Caught the qemu pbx-a9 testing from PM; got qemu built and getting a handful of lines of output both from a kernel from arm.com's site and a linaro-2.6.37 that I built for it. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-01-21
=A9 Qemu= I've spent most of the week looking at QEmu emulation of SMP A9. The model (of a realview-pbx-a9) doesn't have any working block IO; I spent some time looking at trying to get SD working and got part way, but I fell back to using NFS root. It seems to work OK for basic CPU emulation, and SMP 'works' in the sense that the guest sees multiple CPUs; however QEmu is restricted to only using one host CPU core for multiple guest CPUs, so it's of limited help in debugging SMP code. Video doesn't seem to work either. To get to that point it does need a bunch of patches to QEmu, most of which Peter Maydell already knows of; I've put some notes here : https://wiki.linaro.org/Internal/People/DaveGilbert/QEMUA9SMP Note that the realview-pbx doesn't currently have a Linaro hardware pack; I used a kernel from ARMs website and a 2.6.37 I built myself. =SPEC= SPEC ref got quite a far way through on Canis (with half-duplex ether), however 'lbm' failed when running in ref mode (while having worked in test and train) giving a different output; it takes quite a long time to fail. = Perf = I sent an updated version of my patch for perf's thumb annotation upstream. and apparently we have a Panda on the way. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Neon Support in kernel
On 26 January 2011 12:12, Dave Martin wrote: > Hi Vijay, > > On Sat, Jan 22, 2011 at 9:59 AM, Vijay Kilari wrote: >> Hello Dave, >> >> Thanks for this info. >> >> I have few more queries after looking at the results of memset on A9 & A8. >> I agree that externel bus speed matters in comparision across platforms. >> >> 1) Why memset is performance is good on A8 than A9?. any justification? > > I've CC'd the linaro-toolchain list who have been working on this > topic and may be able to provide you with more information. Unfortunately we don't know why Neon was a bad idea for memset etc on A9, it's just the tests show it being worse and the advice we get says to avoid it - we've just not got an explanation. The test code is trivially simple for the cases I tried. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-01-28
SPEC Tried to track down what was going on with lbm; it doesn't seem to be repeatable on canis1; I'd previously seen it fail at O1 and work at O0 and tried to chop down the flags between the two; but after adding all the flags back in on top of -O0 it still worked and then I tried -O1 again and it worked. Going to try on another machine, but it might be uninitialised data somewhere. Panda Our panda arrived; it's now happily nestling near our Beagles and running the 0126 headless snapshot (with 0127 hwpack). It seems fine except for rather slow USB and SD IO. Tip: Panda's do absolutely nothing (no LEDs, no serial console activity) unless you put an SD card in with the firmware on. Libffi Wrote the changes for armhf. Tested on arm, armhf, i386, ppc and s390x - all happy. (Not too unsuspectingly variadic calls just work on everything other than armhf without the api change) Mailed Python CType list asking how much of a pain the API change will be and any hints on what might be affected. Awaiting sign off for submission of code. Optimised library routines Looked at benchmarking 'git'; I'd seen previous discussions where it had been pointed out that it spends a lot of time in library routines; and indeed it does spend useful amounts in memchr, memcpy and friends on a simple git diff v2.6.36 v.2.6.37 > /dev/null of the current kernel tree produces a useful ~25second run. One interesting observation is that the variation in the times reported by 'time' - i.e. user, system and real, the variation in user+system is much less than either user or system individually and is quite stable (within ~0.7% over 10 runs). I've just tried preloading my memchr routine in and it does get a consistent 1-1.2% improvement which does look above the nice. Also asked on libc-help list for suggestions as to other benchmarks people actually trust to reflect useful performance increases in core routines as opposed to totally artificial ones. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
IT block semantic question
Hi, What do people understand to be the expected semantics of IT blocks in the cases below, of which there has been some confusion in relation to a recent Qt issue. The code in question had a sequence something like: comparison IT... EQ blahEQ TEQ BEQ The important bits here are that we have an IT EQ block and two special cases: 1) There is a TEQ in the IT block - are all comparisons in the block allowed and do their effects immediately take effect? As far as I can tell this is allowed and any flag changes are used straight away; 2) There is a BEQ at the end of the IT block, as far as I can tell, as long as the destination of the BEQ is close it shouldn't make any difference if the BEQ is included in the IT block or not. Does that match everyone elses understanding? Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: IT block semantic question
On 2 February 2011 10:47, Dave Martin wrote: > On Tue, Feb 1, 2011 at 12:33 PM, David Gilbert > wrote: >> Hi, >> What do people understand to be the expected semantics of IT blocks >> in the cases below, of which there has been some confusion >> in relation to a recent Qt issue. >> >> The code in question had a sequence something like: >> >> >> comparison >> IT... EQ >> blahEQ >> TEQ >> BEQ >> >> The important bits here are that we have an IT EQ block and two special >> cases: >> >> 1) There is a TEQ in the IT block - are all comparisons in the block >> allowed and do their effects immediately take >> effect? As far as I can tell this is allowed and any flag changes are >> used straight away; > > Yes; yes; and: you're right. This was a specific intention, since > there was always a common idiom on ARM of sequences like this: > > CMP r0, #1 > CMPEQ r1, #2 > CMPEQ r2, #3 > BEQ ... > > with the effect of "if(r0==1 && r1==2 && r2==3) ..." > >> >> 2) There is a BEQ at the end of the IT block, as far as I can tell, >> as long as the destination of the BEQ is close it shouldn't >> make any difference if the BEQ is included in the IT block or not. > > Again, I believe you're right. The assembler will generate different > code, because the explicit conditional branch encodings are not > allowed in IT blocks. But the assembler takes care of this for you: > > : > 0: d001 beq.n 6 > > versus > > 2: bf08 it eq > 4: e7ff beq.n 6 > > 0006 : > > Both snippets are equivalent, though as you say, with IT you can > insert more code between the branch and its destination before the > assembler will barf with a fixup overflow, because the unconditional > branch encoding (e000..e7fff) has more bits to express the branch > offset. Thanks for the confirmation Dave. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: do we consider extravagant memory usage a gcc bug?
On 2 February 2011 12:28, Peter Maydell wrote: > On 2 February 2011 11:54, Peter Maydell wrote: >> ie gcc wants (at least) 100M of RAM trying to compile a 190K sourcefile. >> (and probably more overall since the board has 500MB RAM total and >> it hit the out-of-memory condition). > > On a rerun which hit the kernel OOM-killer the kernel said: > Killed process 5362 (cc1) vsz:480472kB, anon-rss:469676kB, file-rss:88kB > > so gcc's claim that it only wanted 100MB is underreading rather. 480MB does appear excessive; to be a little fair to gcc that file does look like it's trying to build itself as a vast inline'd set of switch statements so it will be stressing the compiler. Is this 480MB much more than on x86 or on older versions of the compiler? Dave (resend remembering to hit all) ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-02-04
== String routines == * After some discussions about IT semantics managed to shave a couple of instructions out of a couple of routines * Got around to trying a suggestion that was made some months ago, that LDM is faster than LDRD on A9's; and indeed it does seem to be in some cases; those cases seem pretty hard to define though - it's no slower than LDRD, so it seems best to avoid LDRD. * Digging around eglibc's build/configure system to see how to add assembler routines to only get used on certain build conditions (i.e. v7 & up) == SPEC == * Compiled lbm -O2 and ran it on our local panda and on Michael's ursa1 - it seems happy (with a drop of swap); so I'd say that confirms the issues I previously had were local to something on canis. That's a bit of a pain since it's the only machine with enough RAM to run the rest of the suite. == Other == * Tested a headless Alpha-2 install on our Beagle C4 - mostly worked * Tested qemu-linaro release on the realview-pbx kernel/nfs setup I had * A simple smoke test for pldw on qemu * Tripped over ltrace not working while trying to profile git's use of memcpy and memcmp; it does some _very_ odd things; it's predominant size of memcpy seems to be 1 byte. ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-02-11
== String routines == * Copied an improvement I'd previously made to memchr (removing a branch using a big IT block) to strlen * Modified benchmark setup to build everything as a library to fairly give everything a PLT overhead. * Pushed optimised memchr and strlen and simple strchr into cortex-strings bzr repo * Patched eglibc to use memchr and strchr code - although currently fighting to get appropriate .changes file == ffi == * Kicked off TSC request for license permissions == bugs == * Built and recreated the qt4-x11 bug, produced all the dumps and boiled it down to a few lines of suspicious RTL for Richard. ** Away next week. ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-02-25
== ffi == * Sent variadic patch for libffi to libffi-discuss * Worked through some suggestions from Chung-Lin, need to do some rework == string routines == * memchr & strchr patch sent for inclusion in ubuntu packages * tried sqlite's benchmarks - they don't spend too much time in the C library; although a few % in memcpy, and ~1% in memset (also seem to have found an sqlite test case failure on ARM and filed as bug 725052) == porting jam == * There wasn't much traffic on #linaro during this related to the jam * I closed bug 635850 (fastdep FTBFS) which was already fixed with an explicit fix for ARM in the changelog and bug 492336 (eglibc's tst-eintr1 failing) which seems to work now but it's not clear when it was fixed. * Looking at eglibc's test log there seem to be a bunch of others that are failing and may well be worth investigating. * bug 372121 (qemu/xargs stack/number of arguments limit) seems to work ok, however the reporter did say it was quite a fragile test; that needs more investigation to see whether the original reason has actually been fixed. == misc == * swapping notes with Peter on the PBX SD card investigation Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Getting linaro toolchain binaries
On 2 March 2011 18:44, John Rigby wrote: > FWIW, Michael's recipe for building here > https://code.launchpad.net/~michaelh1/+junk/cross-build has worked > well for me. It's OK for people familiar with toolchains; the problem is where you have a subject specialist who knows how to write C or ARM assembler but really doesn't know anything about toolchains or wrangling build tools; they just need something that works out of the box. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-03-04
* Investigated and fixed sqlite3 testsuite failure on ARM (bug 725052) * Discussing libffi API changes with maintainer; hopefully he's going to send out his comments today. * Looking at how to upstream the string routine changes * Need to look at big endian testing * Testing QEmu pre-release for Peter; looking very nice. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-03-10
== hard-float == * Updated libffi variadic patch and Sent updated libffi variadic patch to the ffi mailing list. == String routines == * Got a big endian build environment going * Patched up memchr and strlen for big endian; turned out to be a very small change in the end; and tested it on qemu-armeb - note that an older version it didn't work on, but a newer one it did; I'll assume the newer one is correct. * Fixed a couple of build issues in the cortex strings test harness == Other == * Kicked off a SPEC2006 train run on canis using the 2011.03 compilers I'm on holiday tomorrow (Friday) and Monday. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-03-17
Short week * libffi patch accepted upstream * eglibc integration of string routine changes - I have something that works but it's more complex than I'd like (to get it to fall back to the C code on stuff I haven't optimised for). * Trying a neon memchr; tried a really simple 8 byte a loop version - it's quite slow on both A8 and A9; branching on the result of comparisons done in the neon is not simple. * Porting jam bug 735877 chromium using d32 float; it was passing vpfpv3 rather than using the default when configured without neon. On holiday tomorrow (Friday). Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-03-25
== String routines == * Wrote a thumb optimised strchr - As expected it's got nice performance for longer runs but at sizes <16 bytes it's slower, and a lot of the strchr calls are very short, so it's probably not of benefit in most cases ( https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr?action=AttachFile&do=get&target=panda-01-strchr-git44154ec-strchr-abs.png ) * Wrote a neon-memcpy - As previously found with memset, it performs well on A8 but poorly on A9 - it does however do the case where the source/destination isn't aligned quite well even on A9 ; the vld1 unaligned case works with relatively little penalty. (it performs comparably to the Bionic implementation - mine is a bit faster on shorter calls, Bionic is better on longer uses - I think that's because they've got some careful use of preloads where I have so far got none). I'm on holiday up to and including 5th April. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Tegra2 errata bug
https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/739374 is the bug relating to the Tegra errata mentioned today. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-04-08
- Back from holiday, short week. == Porting jam == * We seem to have picked up a lot of ftbfs in the last couple of weeks - which is unfortunate because it may well be too close to the Natty release to do anything about them * Bug 745843 is a repeatable segfault in part of the build process of a package called vtk that is used by a few other things ; I've got this down to a particular call of one function - although gdb is getting rather confused (r0 & r1 changing as I single step across a branch) * Bug 745861 petsc build failure; I'm getting one of two different link errors depending which mood it is in - mpi related? * Bug 745873 - a meta package that just didn't have a list of packages to build with for armel; easy to do a simple fix (provided branch that built) for but the maintainer says it's too late for natty anyway and some more thought is needed. == Other == * Reading over some optimisation documents * Tested weekly release on Beagle-c4 (still no OTG usb and hence no networking for me) * Also simple boot test on panda; not much time for more thorough test. (seems to work) Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-04-15
== Bug triaging == * Bug 745843 (vtk ftbfs) got it down to a bad arm/thumb transition - identified as a linker error and handed off to RichardS * Bug 758082 (augeas ftbfs) tracked it down to overwrite of a parameter in a variadic function before it got stacked; identified by Ramana as another instance of the shrink-wrap bug. * Bug 745861 (petsc ftbfs) isolated the collection of different mpi related problems this is hitting; really need to find an mpi expert on this * Bug 745863 & bug 745891 (ftbfs's) - both were compilations that timed out; verified this was due to using lots of RAM and also using lots of RAM on x86 (> ~500MB) - marked as invalid until the build farm grows more RAM * Bug 757427 gconf seg fault - failed to reproduce under various tests (although Michael has now managed to catch it in the act) == Optimisation == * neon memcpy tweeking; added prefetches and unrolled the core loop - now comparable perf to bionic memcpy in most cases (slower on misaligned destination, faster in other cases) * tweaked latrace to print address/length of argument strings so I can get some stats on routine usage. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-04-21
== String and Memory routines == * Profiled denbench with perf and produced a set of stats to show which programs spent how much time in libc and how much time was spent in each routine.While some of the benchmarks are good (like aes) and spend almost no time in libc some of the others (MPEG codecs especially) seem to spend significant times in libc. * Ran all of denbench through latrace to generate sets of library calls; post processed them to extract the section between the clock() calls (and hence in the timed portion) and analysed the hot library calls. I've looked at some of the output but not all of it yet; I get output like: Memcpy stats (dst align/src align/length/number of occurrences/total size copied) memcpy: 0,0,1 , 1588520, 1588520 memcpy: 16,28,4096 , 1, 4096 memcpy: 4,20,16384 , 855, 14008320 This shows that for a bunch of tests they do an inordinate number of 1 byte memcpy's, and a few hundred larger memcpy's with an address %32 which is 4 (and destination %32 is 20) - so not aligned but at least equally misaligned. * Started writing up a report on some of the stats * Also started to try and extract the same stuff from SPEC2k6 == QEMU == * Tested Peter's QEmu release earlier in the week (On Lucid so didn't hit his natty bug) * Wrote up a couple of specs (one for TrustZone and the other for Device Tree integration) ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-05-06
== Bug fighting == * Tracked bug 774175 (apt segfault on armel on oneiric) down to the cortex-a8 branch erratum bug that we found as part of the bug jam a few weeks ago (affecting the more obscure vtk package) - Richard's existing binutils fix should fix this. == String routines == * Struggled to get 'perf' to get sane results from profiling spec; some of the samples are obviously being associated with the wrong process somewhere along the process (e.g. it's showing significant samples in the sh process but in a library that's used by the actual benchmark. * latrace on spec still running on ursa2 * Wrote a non-neon memcpy; as expected it's aligned performance is very similar to libc/kernel - it's a bit faster in some places but slower in some odd places (e.g. n*32+1 bytes is a lot slower for some reason). It's also really bad on mis-aligned cases, I tried to take advantage of the v7's ability to do misaligned loads - but they really are quite slow. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: [ACTIVITY] 2011-05-06
On 8 May 2011 13:55, Hakehuang wrote: > Can there be something using pragma option to disable neon for each function? I don't think there is a pragma like that for ARM at the moment; Gcc does seem to have a #pragma GCC target and also function attributes for target options; but at the moment these are documented as only being used on x86 (where they are used to turn things like sse on and off). What is your use case? Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-05-13
== String routines == * Gave up on perf on silverbell and redid it on ursa2; now have a full set of perf figures and have updated the workload report to show the spec binaries that use significant time in libc and the routines they spend it in; a handful of tests spend very significant amounts of time in libm. * Have ltrace results from about 75% of spec - some of the others are fighting a bit * Optimised the non-neon memcpy; it's now quite respectable except in one or two cases (2 byte misaligned, and for some odd reason source offset by 8 bytes, destination by 12 is way down on any other combination) (Current result graphs here https://wiki.linaro.org/Internal/People/DaveGilbert?action=AttachFile&do=get&target=results-2011-05-13-panda-69321a21.pdf ) Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-05-20
* Profiling SPEC 2k6 still; about 3/4 of the latrace files are generated but it's taking some hand holding with some of them (e.g. finding one that makes millions of calls to a library function that we're not interested in but generates a huge log, and hence needs it excluding). * Working through the ones that I have with analysis scripts and writing the interesting things up. * Submitted ARM test suite fix for latrace (unsigned characterism) * Verified Richard's binutils fix in natty-proposed fixed the vtk FTBFS * Blueprint for 64bit sync primitives. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-05-27
== String routines == * Finally finished the ltrace analysis of the whole of SPEC 2k6 and have written it up - I'll proof read it next week and then send it out to the benchmark list. * Ran memset and memcpy benchmarks of larger than cache sizes on A9 * memcpy on larger than cache sizes (or probably mainly cache miss data) does come back to Neon winning over ARM; my suspicion is that with cache hits we run out of bandwidth on Neon, but that doesn't happen in the cache miss case; why it's faster in that case I'm not sure yet. * memset is still not faster for Neon even on large sizes where the destination isn't in the cache. == Other == * Started looking at 64 bit atomics * Looking at the pot of QEmu work with Peter. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-06-03
== String routines == * Wrote a hybrid ARM/Neon memcpy - it uses Neon for non-aligned cases or large (>=128k) cases * polished up and sent out write up of workload analysis of denbench and spec * Ran denbench with all memcpy and memset varients, graphed up results - SPEC 2k6 is now cooking with the memcpy set - it'll take all weekend. == 64 bit atomics == * Started looking through the Gcc code at the existing non-64bit atomic code; I need to understand how registers work in DI mode and what's going to be needed in terms of temporaries. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-06-10
== String Routines == * Completed gathering the SPEC2k6 memcpy results, graphed them, sent them out * Gathered SPEC2k6 memset results, graphed them, sent them out == 64bit Atomics == * Modified gcc backend to do 64bit Atomic ops - the code looks good, but I've not done much testing yet. == Other == * Upstreamed a small ltrace patch Next week: Plan is to get gcc tests done and attack libgcc for the pre-v7 fallbacks (the tricky bit there is runtime deciding what to use) Also run spec and denbench for strlen and some other string routines ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Interesting failure of vasprintf test from postler
On 16 June 2011 19:37, Marcin Juszkiewicz wrote: > On czw, 2011-06-16 at 20:31 +0200, Marcin Juszkiewicz wrote: >> There is a ftbfs on armel bug for postler: >> https://bugs.launchpad.net/ubuntu/+source/postler/+bug/791319 >> >> Attached test compiles fine on amd64 but fails on armel: >> >> 20:16 hrw@malenstwo:postler-0.1.1$ gcc _build_/.conf_check_0/test.c >> _build_/.conf_check_0/test.c: In function ‘main’: >> _build_/.conf_check_0/test.c:5:1: error: incompatible type for argument >> 3 of ‘vasprintf’ >> /usr/include/stdio.h:396:12: note: expected ‘__gnuc_va_list’ but >> argument is of type ‘char *’ >> >> Can someone explain me why this happens? >> >> Ubuntu armel and armhf cross compilers, CSL 2011.03-42 have same >> problem. > > Compiled fine with gcc 4.3 from Poky toolchain. Same with Emdebian > gcc-4.3 cross compiler. Looks more to me like the old compilers are the ones with the problems and the new one is correctly throwing an error - it's supposed to have a va_list there - why would it take a string? Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-06-17
== 64 bit Atomics == * Wrote more test cases; now have a nice 3 thread test that passes - and more importantly, it fails if I replace one of the atomic ops by a non-atomic equivalent. * Modified existing atomic helper code in libgcc to do 64bit * Added init function to 64bit atomic helper to detect presence of new kernel and fail if an old one is present. That last one is a bit of a pain; it now correctly exits on existing kernels and aborts; qemu user space seg faults because access to the kernel helper version address is uncaught. So first thing I need to do is try the early kernel patch Nicolas sent around, and then I really need to see if qemu can be firmly persuaded to run it. == String routines == * Ran denbench with sets of strlen; started running some spec as well. == QEmu == * Tested Peter's prelease tarball in user space and a bunch of system emulations - successfully managed to say hello to #linaro from an emulated overo board using USB keyboard. == Other == * Booked 4th July week off. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-06-24
== Atomics == * Testing the libgcc fallback code with Nicholas's kernel patch - and then fixing my initialisation code to use init_array's (thanks Richard for the hint) * Tidying stuff up after a review of my patch by Richard - the sync.md is now smaller than the original before I started. * Discussing sync semantics with Michael Edwards - he's spotted that the gcc ARM sync routines need to move their final memory barrier for the compare-exchange case where the compare fails. * Looking at valgrind; it looks like it should be OK with the commpage changes; but it doesn't currently support ldrexd and strexd; there is a patch for it to do ARM mode but not thumb yet. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-07-01
== 64 bit atomics == * Submitted patches to gcc patch list - One comment back already asking if we should really change ARM to have a VDSO to make checks of the user helper version easier * Added thumb ldrexd/strexd to valgrind; patch added to bug in their bugtracker (KDE bug 266035) * Came to the conclusion eglibc doesn't actually need any changes - It's got a place holder for a 64bit routine, but it's unused and isn't exposed to the libc users - Note that neither eglibc or glibc built cleanly from the trunk on ARM. * Started digging into QEmu a bit more to find out how to solve the helper problem == String routines == * Added SPEC2k6 string routine results to my charts; while most stuff is in the noise it seems the bionic routine is a bit slower overall than everything else, and my absolutely trivially simple ~5 instruction loop is a tie for the fastest with my smarter 4 byte/loop using uadd. == Next week == * Sleep, Rest, Relaxation, getting older * (Will be polling email for any more follow ups on my gcc patches) Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY} 2011-07-15
== String routines == * Sent a patch to libc-ports with modified configure scripts to add subdirectories for architecture specific ARM code, and the memchr.S from cortex-strings. == 64 bit atomics == * Working through comments on my patches and the set of discussions about the kernel interface for the helper case - not really sure which way that's going to go. == QEmu == * Looking at how tracing works, considering adding tracing to sd card code to help track down some of the sd card issues. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-07-22
== 64 bit atomics == * Updated gcc patches as per comments from Ramana and Joseph; build currently cooking on Panda == Qemu == * Testing Peter's pre-release, finding bug on beagle (that he tracked down to x-loader change) * Found cause of occasional SD card errors I was seeing (SD: CMD12 in a wrong state); I'll cut a patch next week, but the bug is writing the last sector throws an error and also leaves it in the wrong state * Added a bunch of tracing code to the SD card layer * With the tracing code and fixing the other bug I'm starting to understand how it works - and half a dozen reasons that the emulation is really slow; whether that's the cause of the reported recoverable lock ups under load is an interesting question; I plan to fix the obvious problems and see how it goes. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-07-29
== 64 bit atomics == * Sent updated set of 64bit atomic patches to gcc list with fixes from previous review * Started hunting for other users of 64bit atomics than membase jemalloc, sdl and boost lock free look like possibilities; but I've not looked at them hard yet == QEmu == * Released fix for last SD card block access error - Vincent Palatin released a bunch of SD card fixes a few hours later - that included a fix to the same bug; however it does look like he has a bunch of other stuff we should keep sync'd with. * Changing caching mode to writeback on the block layer fixes bug 732223 (hangs on heavy IO) - goes from 130KB/s to 8MB/s on vexpress - Asked mailing list whether that's reasonable to make as default for SD * Looking at path from CPU->MMC/SD card - the DMA on OMAP is pretty inefficiently emulated, but the soc_dma code has an unused special case for dma'ing to hardware, looks promising but need to figure out how to use it and if it works. * Comparing Vincent's SD card patch with earlier meego patches; partial overlap. == Other == * Pinged libc-ports for comments on my optimised memchr patch * Image testing Next week; I intend to be in Camborne on the afternoon of Monday, Wednesday and Friday. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-08-05
== QEMU == * After discussion with Peter started writing QEMU fixup for 64bit atomic helper version location. * Sent fixes for soc-dma code to qemu list * Trying to understand just how much of omap_dma's code is needed. == Other == * Travelling to/from connect * Wanted to dial into some of the seessions in Corpus and Magdelen rooms but the remote audio from them was unusable. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-08-12
== QEMU == * Finished off a first cut of the 64bit helper patch to QEMU - Gave it to Peter and have reworked most of the things he commented on * This also lead into a bit of a rabbit hole of finding various generic QEMU threading issues * Tested Peter's 11.08 QEMU release (I used linaro-fetch-image-ui for the first time to grab the release images; quite nice, hit a couple of issues but much nicer than crawling around the site to find where the hwpacks are). == Other == * Pinged gcc patches list for more comments on 64bit atomic patch I'm on holiday the week of 22nd (i.e. the week after next). Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Is the Linaro toolchain useful on x86/x86_64?
On 17 August 2011 12:09, Bernhard Rosenkranzer wrote: > Hi, > is the Linaro toolchain (esp. gcc) useful on x86/x86_64, or is an > attempt to use the Linaro toolchain with such a target just asking for > trouble? > (No, I'm not secretly an Intel spy ;) Just trying to have some fun > with my desktop machine ;) ) I believe the idea is that while we don't work to improve x86(_64) we shouldn't break it. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-08-19
== String routines == * Working through updating my eglibc patch for memchr, I think I'm nearly there - took way too long to persuade an eglibc make check to work cross (can't get native build happy). == QEMU == * Sent a new version of my QEMU patch for the atomic helpers to Peter. * Tested the Android beagle image on a real beagle - it fails in pretty much the same way as the QEMU run. == Other == * Had a brief look at bug 825711 - scribus ftbfs on ARM - this is QT being built to define qreal as float on ARM when it's double on most other things, scribus having a qreal variable and something it's defined as a double and then passing it to a template that requires two arguments of the same type; not really sure which one is to blame here! I'm on holiday next week. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-09-02
== QEmu == * Sent 64bit atomic helper fix upstream * Basic boot time and simple benchmarks v Panda board * Tested prebuilt images and Peter's latest post-merge QEmu tree - The full Ubuntu desktop on an emulated Overo is a bit slow - it's rather short on RAM - The full Ubuntu desktop on an emulated VExpress isn't bad; it's got the full 1G; (with particularly grim line of awk to mount vexpress images based on Peter's suggestion of the use of 'file') == String routines == * Pushed memcpy and memset up to cortex-strings bzr * Working through memset issue with Michael - Made my code a little less sensitive to initial alignment == Hard float == * Testing libffi 3.0.11rc1 - still hasn't got variadic patch in, but hopeing it will land later in the cycle. == Other == * Excavating inbox after week off. * Build LMbench and kicked run off on Panda. (Got stuck in some heuristics under emulation) Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Benchmarking / justifying cortex-strings
On 5 September 2011 04:21, Michael Hope wrote: > On Fri, Sep 2, 2011 at 4:08 PM, Michael Hope wrote: >> Hi Dave. I've been hacking away and have checked in a couple of >> benchmarking and plotting scripts to lp:cortex-strings. The current >> results are at: >> http://people.linaro.org/~michaelh/incoming/strings-performance/ >> >> All are done on an A9. The results are very incomplete due to how >> long things take to run. I'll leave ursa3 doing these over the >> weekend which should flesh this out for the other routines. > > Right, that's done. The new graphs are up at: > http://people.linaro.org/~michaelh/incoming/strings-performance/ > > The original data is at: > http://people.linaro.org/~michaelh/incoming/strings-performance/epic.txt > > Here's the relative performance for all routines with eight byte > aligned data and 128 byte blocks: > http://people.linaro.org/~michaelh/incoming/strings-performance/top-000128.png > > memchr, memcpy, strcpy, and strlen all look good at this block size. Good. > Here's the speed versus block size for eight byte aligned data: > http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-memchr-08.png Nice; odd dip between 8 and 16 chars - I don't switch to the smarter stuff until 16 bytes. > http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-memset-08.png Hmm yes the short ones could be a bit faster - I always tended to use log X scales :-) The really small ones I wouldn't worry too much about, the interesting stuff is 32-512 where I'd have expected it to have got it's act in gear. > http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strchr-08.png > http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strchr-08.png The version of strchr that's in there is the simple-as-possible strchr; it's byte at a time - I also have a version that uses similar code to memchr that goes fast at large sizes but is slower for small matches: See: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr?action=AttachFile&do=get&target=panda-01-strchr-git44154ec-strchr-abs.png I'd made the call that performance at smaller strings was probably more important. > http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strcmp-08.png Huh? I haven't written a strcmp - that looks like newlibs? > http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strcpy-08.png Ditto. > http://people.linaro.org/~michaelh/incoming/strings-performance/sizes-strlen-08.png That's very nice - although quite bizarre; even the lower end of the steps are suitably fast so not really anything to worry about; but it would be great to understand where the 1500 cycle difference is going at the large end. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Benchmarking / justifying cortex-strings
On 5 September 2011 17:40, Christian Robottom Reis wrote: > On Mon, Sep 05, 2011 at 03:21:49PM +1200, Michael Hope wrote: >> memchr is good. memset could be better for blocks of less than 1k. >> strchr gets second place but is eclipsed by newlib's version. strcmp >> need work. strcmp is good. > > It's strcpy which is good in this last sentence, though it basically > matches newlib's version. I think that's because it IS newlib's version - I've not done a strcpy. > I'm curious about the "political" side of cortexstrings -- is there > active interest by the library maintainers in picking up our versions? There is interest from partners in having optimised versions; I think the library maintainers are happy to take it if you can convince them that they are improvements. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-09-09
== String routines == * Trying to understand my strlen behaviour that Michael identified - Found lots of ways of making the faster case slower, but none of making the slower case faster! - Perf not being available on Panda (bug 702999/843628) made it difficult to dig down * Fixing standards corner cases for strchr/memchr - input match needs to be truncated to char (fixes bug 842258 & 791274) * Tidying up formatting for cortex-strings release * Looking at eglibc integration again - getting confused by what has to happen in config.sub and how other users of it cope with triplets like armv7 even though it's not in config.sub == QEMU == * Testing Peter's QEMU release - All good - Lost a few hours due to the broken version of l-i-f-ui in Oneiric - PPA version works OK * A little bit of perf profiling == Other == * Managed to get hold of a nice fast build machine ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
eglibc and fun with config.sub
As mentioned on the standup call this morning, I've been trying to get my head around the way different parts of the toolchain using the config scripts and the triplets. I'd appreciate some thoughts on what the right thing to do is, especially since there was some unease at some of the ideas. My aim here is to add an armv7 specific set of routines to the eglibc ports and get this picked up only when eglibc is built for armv7; but it's getting a bit tricky. eglibc shares with gcc and binutils a script called config.sub (that lives in a separate repository) which munges the triplet into a $basic_machine and validates it for a set of known triplets. So for example it has the (shell) pattern: arm | arm[bl]e | arme[lb] | armv[2345] | armv[345][lb] to recognise triplets of the form arm- armbe- armle- armel- armbe- armv5- armv5l- or armv5b- It also knows more obscure things such as if you're configuring for a netwinder it's an armv4l- system running linux - but frankly most of that type of thing are a decade or two out of date. Note it doesn't yet know about armv6 or armv7. eglibc builds a search path that at the moment includes a path under the 'ports' directory of the form arm/eabi/$machine where $machine is typically the first part of your triplet; however at the moment eglibc doesn't have any ARM version specific subdirectories. If I just added an ports/sysdeps/arm/eabi/armv7 directory it wouldn't use it because it searches in arm/eabi/arm if configured with the triplet arm-linux-gnueabi or --with-cpu sets $submachine (NOT $machine) - so if you pass --with-cpu=armv7 it ends up searching arm/eabi/arm/armv7 if you used the triplet arm-linux-gnueabi. If you had a triplet like armel then I think it would be searching arm/eabi/armel/armv7 So my original patch ( http://old.nabble.com/-ARM--architecture-specific-subdirectories,-optimised-memchr-and-some-questions-td32070289.html ) did the following: * Modified the paths searched to be arm/eabi (rather than arm/eabi/$machine) * If $submachine hadn't been set by --with-cpu then autodetect it from gcc's #defines which meant that it ignored the start of the triplet and let you specify --with-cpu=armv7 After some discussion with Joseph Myers, he's convinced me that isn't what eglibc is expecting (see later in the thread linked above); what it should be doing is that $machine should be armv7 and $submachine should be used if we wanted say a cortex-a8 or cortext-a9 specific version. My current patch: * adds armv6 and armv7 to config.sub * adds arm/eabi/armv7 and arm/eabi/armv6t2 and one assembler routine in there. * If $machine is just 'arm' then it autodetects from gcc's #defines * else if $machine is armv then that's still $machine So if you use: a triplet like arm-linux-gnueabi it looks at gcc and if that's configured for armv7-a it searches arm/eabi/armv7 a triplet like armv7-linux-gnueabi then it searches arm/eabi/armv7 irrespective of what gcc was configured for a triplet like armv7-linux-gnueabi and --with-cpu=cortex-a9 then it searches arm/eabi/armv7/cortex-a9 then arm/eabi/armv7 As far as I can tell gcc ignores the first part of the triplet, other than noting it's arm and spotting if it ends with b for big endian; (i.e. configuring gcc with armv4-linux-gnueabi and armv7-linux-gnueabi ends up with the same compiler). binutils also mostly ignores the 1st part of the triple - although is a bit of a mess with different parts parsing it differently (it seems to spot arm9e for some odd reason); as far as I can tell gold will accept armbe* for big endian where as ld takes arm*b ! If you're still reading, then the questions are: 1) Does the approach I've suggested make sense - in particular that the machine directory chosen is based either on the triplet or where the triplet doesn't specify the configuration of gcc; that's my interpretation of what Joseph is suggesting. 2) Doing (1) would seem to suggest I should give config.sub armv6t2 and some of the other complex names. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: eglibc and fun with config.sub
OK, so we seem to have agreement here that what we want is autodetect for eglibc and forget about the triplet; well technically that probably makes my life easier, and I don't think it's too hard a sell. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-09-16
== String routines == * Tidying up bits of cortex strings for the release process * Nailing down the behaviour of config.sub and the config systems in gcc, binutils and eglibc == Other == * A discussion on synchronisation primitives on various CPUs that started on the gcc list - looking at http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html - pointing out the 64bit instructions - asking why they used isb's when neither the kernel or gcc use them (answer the DMBs should be fine as well, but there is some debate over which is quicker, oh and DMBs are converted to slower dsb's on most A9s due to an errata). * Looking for docs on the non-core bits of current SoCs * Extracting some denbench stats from a few months back for Ramana About a day of non-Linaro IBM stuff. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: eglibc and fun with config.sub
On 19 September 2011 00:48, Michael K. Edwards wrote: > Please coordinate with Jon Masters at RedHat/Fedora and Adam Conrad at > Ubuntu/Debian on this. (Cc'ing the cross-distro list, through which the > recent ARM summit at Linux Plumbers was organized.) OK, let me summarise for the new people on the cc. I'm looking to adding a few ARM optimised string routines to eglibc (memchr initially) and they requires ARMv6t2 and newer, hence I started looking at how it finds architecture specific code. At the moment eglibc uses the architecture from the 1st part of the triplet as part of the search path in the source tree, so configuring for armv5-linux-gnueabi will end up looking in an arm/eabi/armv5 directory. What I was going to do (after some discussion with Joseph Myers: http://old.nabble.com/-ARM--architecture-specific-subdirectories,-optimised-memchr-and-some-questions-td32070289.html ) 1) Add armv6/v7 to config.sub 2) If the version wasn't specified in the 1st part of a triplet, then use gcc ifdef's to autodetect what we're building for 3) If the version was specified in the triplet then use it A bit of digging however led me to find that neither gcc or binutils seem to use the version in the triplet to determine anything, and the discussion on the linaro-toolchain list that started this thread have come to the conclusion that people would actually prefer to always ignore the triplet (like binutils and gcc) and just autodetect from gcc ifdef's. Thoughts? Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-09-23
== String routines == * Having got agreement on ignoring the triplet for picking the routine, I'm just testing a patch, but fighting a qemu setup. * Found the binfmt binding for armeb was wrong (runs the le version); filed bug with fix in Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Use of memcpy() in libpng
On 27 September 2011 14:16, Christian Robottom Reis wrote: > On Tue, Sep 27, 2011 at 09:47:33AM +0100, Ramana Radhakrishnan wrote: >> On 26 September 2011 21:51, Michael Hope wrote: >> > Saw this on the linaro-multimedia list: >> > http://lists.linaro.org/pipermail/linaro-multimedia/2011-September/74.html >> > >> > libpng spends a significant amount of time in memcpy(). This might >> > tie in with Ramana's investigation or the unaligned access work by >> > allowing more memcpy()s to be inlined. >> >> It's the unaligned access and the change / improvements to the memcpy >> that *might* help in this case. But that ofcourse depends on the >> compiler knowing when it can do such a thing. Ofcourse what might be >> more interesting is the kind of workload analysis that Dave's done in >> the past with memcpy to know what the alignment and size of the buffer >> being copied is. > > If you guys could take a look at this there is a potential requirement > for the MMWG around libpng optimization; we could fit this in along with > other work (possible vectorizing, etc) on that component. It wouldn't take long to analyse the memcpy calls - life would be easier if we had the test program and some details on things like what size of images were used in these benchmarks. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-09-28
== String routines == * Got eglibc testing setup happy at last - Note that -O3 builds generally seem to give a few more errors that are probably worth looking at - -march=armv6 -mthumb hit some non-thumb1 instructions (normally non-lo registers), again worth looking at - Cross testing to Qemu user mode often stalls, mostly on nptl tests that abort/fail when run in system/natively * Sent new version of eglibc/memchr patch upstream * Now have working newlib test setup and reference set - next step is to try adding my memchr there == Other == * Testing a QEmu patch with Peter * Looking at bug 861296 (difference in mmap layouts) * Adding a few suggestions to the set of cpu hotplug tests. * Dealing with the Manchester lab cold. Short week; back on Monday Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-10-07
== String Routines == * Built and tested a newlib with my memchr in - ready to go with a bit of tidy up. * Followed up on my eglibc patch submission by a comment suggesting the use of --with-cpu pointing back at the previous discussion. == 64 Bit atomics == * Updated gcc patch based on Ramana's comments, retested and posted new version - Lost half a day to a failing SD card in our panda. == QEMU == * Posted a patch that made one variable thread local using __thread that fixes multi threaded user mode ARM programs (e.g. firefox); this seems to have mutated on the list into a patch for more general thread local support. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Screwy panda timing?
I've just tried rerunning some benchmarks on my panda, which I reinstalled recently and am getting some odd behaviour: The kernel is 3.0.0-1404-linaro-lt-omap For example: simple_strlen: ,102400, loops of ,62, bytes=6.054688 MB, transferred in ,20324707.00 ns, giving, 297.897898 MB/s simple_strlen: ,102400, loops of ,32, bytes=3.125000 MB, transferred in ,7904053.00 ns, giving, 395.366782 MB/s simple_strlen: ,102400, loops of ,16, bytes=1.562500 MB, transferred in ,7354736.00 ns, giving, 212.448142 MB/s simple_strlen: ,102400, loops of ,8, bytes=0.781250 MB, transferred in ,91553.00 ns, giving, 8533.308575 MB/s simple_strlen: ,102400, loops of ,4, bytes=0.390625 MB, transferred in ,1495361.00 ns, giving, 261.224547 MB/s simple_strlen: ,102400, loops of ,2, bytes=0.195312 MB, transferred in ,1983643.00 ns, giving, 98.461518 MB/s Note the 8 byte one apparently 40 times faster, and for true oddness: smarter_strlen_ldrd: ,102400, loops of ,62, bytes=6.054688 MB, transferred in ,3936768.00 ns, giving, 1537.984331 MB/s smarter_strlen_ldrd: ,102400, loops of ,32, bytes=3.125000 MB, transferred in ,0.00 ns, giving, inf MB/s smarter_strlen_ldrd: ,102400, loops of ,16, bytes=1.562500 MB, transferred in ,4180909.00 ns, giving, 373.722557 MB/s Now, while I like infinite transfer rates, I suspect they're wrong. Anyone else seeing this? Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-10-14
== 64 bit atomics == * Thanks to Ramana for OKing my gcc patches; and Richard for committing them - I've backported these to the gcc-linaro branch and pushed it - hopefully those will pass OK! == String routines == * Sent my memchr patch to upstream newlib, received comments, tweeked, and resent * Sent strlen patch to upstream newlib * Spent some time getting confused by timing issues on our Panda; it got reinstalled with 11.09 a few weeks ago and is now showing some odd behaviours. In particular I'm seeing some tests show completion in 0ns (and my code isn't -that- fast!), and others where the times vary wildly - it's almost as if a timer interrupt is delayed or missing; my same test binary works fine on one of Michael's Ursa's running an older install. == QEMU == * Tested Peters QEMU image for release == Other == * Spent an afternoon reading through the System trace docs On holiday next week; I'll poll email occasionally. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-10-28
== 64 bit atomics == * I've been building and testing membase * Version 1.7.1.1 source builds OK (after turning off -Werror due to some of their curious type naming) * The git version fails to build - it doesn't seem consistent * 1.7.1.1 passes simple tests, but there are 3 tests in its test suite that intermittently fail on ARM and seem to be solid on x86. (There are also some that just require timeouts increased due to the relatively slow machine). * t/issue_163.t turned out to be a timing race in the test itself, made worse by being on a relatively slow machine and probably made worse by the Pandas odd idea of timing. That was reported to them with a break down of it, and upstream has fixed their test. ( http://code.google.com/p/memcached/issues/detail?id=230 ) * t/issue_67.t is proving tougher; once in a while memcached will lock up during init in thread_init; there is one particular point where adding a printf will make it work apparently reliably. I've got one or two ideas but I need to check my understanding of pthread_cond_wait first. * There is an assert I've seen triggered once - not looked at that yet. == String routines == * While I was off last week, my memchr and strlen were accepted into newlib * Joseph has responded to my eglibc mail, with a couple of small queries. == Other == * Wrote a more detailed test case for bug 873453 (odd timing behaviour on panda); it's quite odd - I can get > ~80ms timing discrepency so it's not a clock granularity issue. * Replicated a QEMU crash for Peter. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-11-04
=== 64 bit atomics * I got the race in membase down to a futex issue, and asking dmart pointed me at a kernel bug that affects recent kernels where a fix had gone in about a month ago. That was a nasty one! * I've still got a few bugs left; most are turning out to be timing races in the test code (e.g. one that times out after 2seconds but the code takes around 1.7 seconds ish - but if something else gets in trips over the line, and another one where it did a recv_from on a socket but only got the start of a message, presumably because the sender had used multiple sends). It's tricky going because the tests are a combination of most scripting languages (perl, python, ruby with a splash of Erlang). I've so far found no bugs in the atomic code. * I looked at apr and SDL-1.3; both of which use atomics; but end up not using 64bit atomics; the tendency is for them to ensure they can do atomics on long and on a void*; both of which for us are 32bit. === String routines * I've got the Newlib A15 optimised memcpy running in a test harness at the moment for comparison. === Listening to connect * I listened in to a few connect sessions each day; the 1st day or so was 3/4 lost on audio systems that didn't work (I'm especially annoyed at not being able to hear the QEMU for A15/KVM session and toolchain support for kernel). The Rypple session was rather lost through the lack of any screen share or slides. ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-11-11
== 64 bit atomics == * Nailed one more of the membase tests; again this was a test harness race condition (which I've reported here: http://code.google.com/p/moxi/issues/detail?id=2&thanks=2&ts=1321037460 ) In this case there were two calls to write) performed on the server, yet the test client performed a single read and compared the result to what it was expecting; and got lucky on x86 and about half the time on ARM in that the server data managed to all get read by the 1st read. I think this leaves one more case - that I've seen rarely. == Qemu == * Tested Peter's 11.11 pre release; ran into a couple of issues (vexpress without sound causing hangs, and the Linaro 11.10 Beagle and Overo images not running X). Also filed a couple of bugs in l-i-f-ui that I tripped over while testing it. == String routines == * The new newlib A15 optimised memcpy is slower on an A9 than my routines; posted to newlib list asking what the normal way of dealing with a bunch of different routines is. Would it make sense to get gcc to define a GCC_ARM_TUNE_CORTEX_A-whatever ? == Other == * Watched the Youtube video of the Kernel/Toolchain discussion - for those who didn't attend, I'd encourage a check of the Youtube videos, they're pretty nicely done. * Got pulled away on non-Linaro work for about half the week. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-11-18
== 64 bit atomics == * Still fighting membase * Cleaned up a bunch of other issues, but I'm back at an 'expiry' issue, where the test stores some data with a fixed expiry time and then waits until after it should have expired, and checks it has. Except on ARM it sometimes doesn't expire quickly enough. I've got enough debug now to see that the server processes view of time (which it updates via an event about every second) is sometimes very behind gettimeofday()'s view of time - and have a small test for it. This doesn't seem to happen on x86. The good part is that it's now a much smaller test, the bad part is that it fails rarely - somewhere between 1/1000 and 1/100 depending on its mood. * Looked at a few other things to see if they might use 64 bit atomics: - spice's (as in the VNC like protocol) FAQ said it needed 64bit atomics and didn't work on 32bit machines due to that; but the source appears to have been fixed for 32bit. - Looked at boost lock-free; it does have an implementation using gcc's __sync primitives, however for ARM it uses a hand coded set of primitives, those are missing the 64 bit implementation, but the contributor of the ARM code said that the boost lock-free author preferred not to use the gcc primtives. == Other == * Testing latest libffi rc - Had most of my varargs for hf fix in (had missed one part of a test) * 1 day of non-linaro work I'm on holiday next week. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-12-02
== String routines == * Sent updated memchr to the eglibc list == 64 bit atomics == * Ran a set of timing consistency tests that a colleague had sent me while I was off; Panda passed those, so time doesn't appear to be going backwards or anything, so that's not the problem with membase. * Pushed the code into linaro-gcc. == QEmu == * Tested Peter's prerelease - all good. * Started looking at the issues for running in TCG mode on ARM == Other == * Read through the ARMv8 instructions docs that landed on arm.com; quite interesting. Note that multiple instruction IT blocks are listed as being deprecated for 32bit mode on v8 (although this will work but it can be put in a mode to fault you to make it easy to find the uses). * Some debugging of Panda odd timing issue with Paul Mckenney. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Release notes for GCC 4.6
On 7 December 2011 20:36, Andrew Stubbs wrote: > Hi all, > > I've copied all those who made commits to GCC 4.6 this month. > > Could you please give me a sentence or two for the release notes? Support for 64bit __sync* primitives on ARM. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-12-09
== QEMU == * Wrote a fix for bug 883133 (code buffer/libc conflict); spent some time testing it because I wasn't sure whether the crash I was seeing after that was my fix not being complete or actually bug 893208. * Got it to boot with -cpu 486; without that it's triple faulting in a divide just after a load of time stamp reads which makes me suspicious that 893208 is a timer problem. * (It also fails when used with vnc graphics, but works in SDL and curses, but I'll leave that bug for another time). == String routines == * With one more tweak to my memchr, it finally made it into eglibc. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-12-16
== General == * Tidying things up and updating my list of statuses == String routines == * Adding strchr and strlen to eglibc; tests running at the moment. Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[ACTIVITY] 2011-12-23 - and goodbye
== QEMU == * Wrote the context routines for Eglibc, including those that QEMU uses These pass all the context tests I could find, including QEMUs coroutine tests, and with them QEMU seems to boot OK. I've got a full eglibc test run going at the moment, but I don't think anything else uses them. I posted them with comments and a question to libc-ports; I'll try and chase follow ups. == String routines == * I posted the strchr and strlen routines to eglibc (libc-ports) * On strchr the question of whether it was worth using the longer version that's faster for longer strings (but slower for shorter strings) came up. I posted some stats, observations etc - and there is still a discussion on going about it. * For strlen, rth noted the same trick that I'd originally seen in newlib (and for which RichardS and Ramana had suggested) of a quicker end-of-string sequence using clz. I'd avoided this because I'd originally seen it in newlib and didn't want to copy it; but since 3 people have individually suggested it it would seem using. == Goodbye! == Thank you all for a fun & interesting year! I'm sure many of us will meet online again in the future.I'll try and follow my linaro.org address while it's still live to check for any replies to any patches/comments etc. Feel free to mail me at david...@uk.ibm.com (work) or d...@treblig.org (home); for Linaro people I've also added some more contact methods at: https://wiki.linaro.org/Internal/People/DaveGilbert/Contact Thanks again! Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain