Re: FDO and LTO on ARM
On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka wrote: >>> Did you try using FDO with -Os? FDO should make hot code parts >>> optimized similar to -O3 but leave other pieces optimized for size. >>> Using FDO with -O3 gives you the opposite, cold portions optimized >>> for size while the rest is optimized for speed. > > FDO with -Os still optimize for size, even in hot parts. I don't think so. Or at least that would be a bug. Shouldn't 'hot' BBs/functions be optimized for speed even at -Os? Hm, I see predict.c indeed returns always false for optimize_size :( I thought we had just the neither cold or hot parts optimized according to optimize_size. > So to get resonale > speedups you need -O3+FDO. -O3+FDO effectively defaults to -Os in cold > portions of program. Well, but unless your training coverage is 100% all parts with no coverage get optimized with -O3 instead of -Os. And I bet coverage for mozilla isn't even close to 100%. Thus I think recommending -O3 for FDO is usually a bad idea. So - did you try FDO with -O2? ;) > Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is > bug. It is not very thoroughly since it is not really used in practice. > >>> Also do you get any warnings on profile mismatches? Perhaps something >>> is wrong to the degree that the relevant part of profile gets >>> misapplied. >> >> I don't get any warning on profile mismatches. I only get a "few" >> missing gcda files warning, but that's expected. > > Perhaps you could compile one of less trivial files you are sure that are > covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps > of the compilation so I can double check the profile seems sane. This could > be good start to rule out something stupid. > > Honza >> >> Cheers, >> >> Mike >> > > >
Re: [named address] rejects-valid: error: initializer element is not computable at load time
Georg-Johann Lay wrote: > Ulrich Weigand wrote: > > This is pretty much working as expected. "pallo" is a string literal > > which (always) resides in the default address space. According to the > > named address space specification (TR 18037) there are no string literals > > in non-default address spaces ... > > The intension of TR 18037 to supply "Extension to Support Embedded Systems" > and these are systems where every byte counts -- and it counts *where* the > byte will be placed. > > Basically named AS provide something like target specific qualifiers, and > if GCC, maybe under the umbrella of GNU-C, would actually implement a feature > like target specific qualifiers, that would be a great gain and much more > appreciated than -- users will perceive it that way -- being more catholic > than the pope ;-) The problem with all language extensions is that you really need to be careful that the new feature you want to add is fully specified, in all its potential interactions with the rest of the (existing) language features. If you don't, some of those ambiguities are certain to be causing you problems later on -- in fact, that's just what has happened time and again with GCC extensions that were added early on ... This is why these days, extensions usually are accepted only if they *are* fully specified (which usually means providing a "diff" to the C standard text that would add the feature to the standard). This is a non-trivial task. One of the reasons why we decided to follow the TR 18037 spec when implementing the __ea extension for SPU is that this task had already been done for us. If you want to deviate from that existing spec, you're back to doing this work yourself. > For example, you can have any combination of qualifiers like const, restrict > or volatile, but it is not possible for named AS. That's clear as long as > named AS is as strict as TR 18037. However, users want features to write > down their code an a comfortable, type-safe way and not as it is at the > moment, > i.e. by means of dreaded inline assembler macros (for my specific case). A named AS qualifier *can* be combined with other qualifiers like const. It cannot be combined with *other* named AS qualifiers, because that doesn't make sense in the semantics underlying the address space concept of TR 18037. What would you expect a combination of two AS qualifiers to mean? > > The assignment above would therefore need to convert a pointer to the > > string literal in the default space to a pointer to the __pgm address > > space. This might be impossible (depending on whether __pgm encloses > > the generic space), and even if it is possible, it is not guaranteed > > that the conversion can be done as a constant expression at compile time. > > The backend can tell. It likes to implement features to help users. > It knows about the semantic and if it's legal or not. > > And even if it's all strict under TR 18037, the resulting error messages > are *really* confusing to users because to them, a string literal's address > is known. It would be possible to the extend named AS implementation to allow AS pointer conversions in initializers in those cases where the back-end knows this can be done at load time. (Since this is all implementation-defined anyway, it would cause no issues with the standard. We simply didn't do it because on the SPU, it is not easily possible.) Of course, that still wouldn't place the string literal into the non-generic address space, it just would convert its address. > > What I'd do to get a string placed into the __pgm space is to explicitly > > initialize an *array* in __pgm space, e.g. like so: > > > > const __pgm char pallo[] = "pallo"; > > > > This is defined to work according to TR 18037, and it does actually > > work for me on spu-elf. > > Ya, but it different to the line above. Sure, because it allocates only the string data, and not in addition a pointer to it as your code did ... > Was just starting with the work and > it worked some time ago, so I wondered. I think some time in the past, there was a bug where initalizers like in you original line were silently accepted but then incorrect code was generated (i.e. the pointer would just be initialized to an address in the generic address space, without any conversion). > And I must admit I am not familiar > will all the dreaded restriction TR 18037 imposes to render it less > functional :-( It's not a restriction so much, it's simply that TR 18037 does not say anything about string literals at all, so they keep working as they do in standard C. > Do you think a feature like "target specific qualifier" would be reasonable? > IMO it would be greatly appreciated by users. > Should not be too hard atop the work already being done for named addresses. As I said, any further extension would need to be carefully specified ... In any case, whether this would then be accepted would be up to the front-end maintainers, of cours
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Georg-Johann Lay wrote: > Ulrich Weigand wrote: > > I'd be happy to bring this up to date if you're willing to work with > > me to get this tested on a target that needs this support ... > > Just attached a patch to bugzilla because an AVR user wanted to play > with the AS support and asked me to supply my changes. It's still in > a mess but you could get a more reasonable base than on a target where > all named addresses vanish at expand. > > The patch is fresh and attached to the enhancement PR49868, just BE stuff. > There is also some sample code. OK, I'll have a look. Looking at your definition of the routines avr_addr_space_subset_p and avr_addr_space_convert, they appear to imply that any generic address can be used without conversion as a __pgm address and vice versa. That is: avr_addr_space_subset_p says that __pgm is both a superset and a subset of the generic address space, so they must be co-extensive. avr_addr_space_convert then says that the addresses can be converted between the two without changing their value. Is this really true? If so, why have a distinct __pgm address space in the first place? Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Ulrich Weigand schrieb: Georg-Johann Lay wrote: Ulrich Weigand wrote: I'd be happy to bring this up to date if you're willing to work with me to get this tested on a target that needs this support ... Just attached a patch to bugzilla because an AVR user wanted to play with the AS support and asked me to supply my changes. It's still in a mess but you could get a more reasonable base than on a target where all named addresses vanish at expand. The patch is fresh and attached to the enhancement PR49868, just BE stuff. There is also some sample code. OK, I'll have a look. Looking at your definition of the routines avr_addr_space_subset_p and avr_addr_space_convert, they appear to imply that any generic address can be used without conversion as a __pgm address and vice versa. There is a conversion HI (pmode) <-> PHI (__pgm) These two modes resp. insns have the same arithmetic but differ in the instructions they emit and in reloading. That is: avr_addr_space_subset_p says that __pgm is both a superset and a subset of the generic address space, so they must be co-extensive. avr_addr_space_convert then says that the addresses can be converted between the two without changing their value. Is this really true? If so, why have a distinct __pgm address space in the first place? Bye, Ulrich AVR hardware has basically three address spaces: Memoy Physical Mode Instruction Holds -- RAM 0, 1, 2, ... HI=Pmode LD*, ST* .data, .rodata, .bss Flash 0, 1, 2, ... PHI LPM .text, .progmem.data EEPROM0, 1, 2, ... --via SFR .eeprom Devices have just some KB of RAM and constants are put into .progmem.data via attribute progmem and read via inline asm. AVR has three address registers X, Y and Z that can access memory. SP is fixed and can just do push/pop on RAM. Adressing capabilities follow. Only accessing byte is supported by HW: RAM: Constant address *X, *--X, *X++ *Y, *--Y, *Y++, *(Y+offset) is Frame pointer *Z, *--Z, *Z++, *(Z+offset) Offset in [0, 63] Flash: *Z, *Z++ Of course, RAM and Flash are no subsets of each other when regarded as physical memory, but they are subsets when regarded as numbers. This lead to my mistake to define RAM and Flash being no subsets of each other: http://gcc.gnu.org/ml/gcc/2010-11/msg00170.html In a typical AVR program the user knows at compile time how to access a variable and use an appropriate pointer like int* or const __pgm int*. In a current program he uses inline asm for the second case. However, there are situations like the following where you like to take the decision at runtime: char cast_3 (char in_pgm, void * p) { return in_pgm ? (*((char __pgm *) p)) : (*((char *) p)); } The numeric value of p will stay exactly the same; just the mode and thus the access instruction changes like if (in_pgm) r = LPM Z (PHI:Z) else r = LD Z (HI:Z or LD X+ or whatever) Linearizing the address space at compiler level is not wanted because that lead to bulky, slow code and reduced the effective address space available for Flash which might be up to 64kWords. An address in X, Y, Z is 16 bits wide, these regs occupy 2 hard regs. Johann
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Georg-Johann Lay wrote: > AVR hardware has basically three address spaces: [snip] OK, thanks for the information! > Of course, RAM and Flash are no subsets of each other when regarded as > physical memory, but they are subsets when regarded as numbers. This > lead to my mistake to define RAM and Flash being no subsets of each other: >http://gcc.gnu.org/ml/gcc/2010-11/msg00170.html Right, in your situation those are *not* subsets according to the AS rules, so your avr_addr_space_subset_p routine needs to always return false (which of course implies you don't need a avr_addr_space_convert routine). Getting back to the discussion in the other thread, this also means that pointer conversions during initialization cannot happen either, so this discussion is basically moot. > However, there are situations like the following where you like to take > the decision at runtime: > > char cast_3 (char in_pgm, void * p) > { > return in_pgm ? (*((char __pgm *) p)) : (*((char *) p)); > } That's really an abuse of "void *" ... if you have an address in the Flash space, you should never assign it to a "void *". Instead, if you just have a address and you don't know ahead of time whether it refers to Flash or RAM space, you ought to hold that number in an "int" (or "short" or whatever integer type is most appropriate), and then convert from that integer type to either a "char *" or a "char __pgm *". > Linearizing the address space at compiler level is not wanted because > that lead to bulky, slow code and reduced the effective address space > available for Flash which might be up to 64kWords. I guess to simplify things like the above, you might have a third address space (e.g. "__far") that is a superset of both the default address space and the __pgm address space. Pointers in the __far address space might be e.g. 3 bytes wide, with the low 2 bytes holding the address and the high byte identifying whether the address is in Flash or RAM. Then a plain dereference of a __far pointer would do the equivalent of your cast_3 routine above. Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Hi, On Fri, 5 Aug 2011, Ulrich Weigand wrote: > > However, there are situations like the following where you like to take > > the decision at runtime: > > > > char cast_3 (char in_pgm, void * p) > > { > > return in_pgm ? (*((char __pgm *) p)) : (*((char *) p)); > > } > > That's really an abuse of "void *" ... if you have an address in the > Flash space, you should never assign it to a "void *". > > Instead, if you just have a address and you don't know ahead of time > whether it refers to Flash or RAM space, you ought to hold that number > in an "int" (or "short" or whatever integer type is most appropriate), > and then convert from that integer type to either a "char *" or a > "char __pgm *". That would leave standard C. You aren't allowed to construct pointers out of random integers. I'd rather choose to abuse "void*" to be able to point into a yet-unspecified address spaces, which becomes specified once the void* pointer is converted into a non-void pointer (which it must be because you can't dereference a void pointer, hence it does no harm to leave its address space unspecified). That would point to a third address space, call it "undef" :) It would be superset of default and pgm, conversions between undef to {default,pgm} are allowed freely (and value preserving, i.e. trivial). Conversion into undef could be rejected. If they are allowed too, then also conversions between default and pgm are possible (via an intermediate step over undef), at which point the whole excercise seems a bit pointless and one could just as well allow conversions between default and pgm. Ciao, Michael.
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Michael Matz wrote: > On Fri, 5 Aug 2011, Ulrich Weigand wrote: > > Instead, if you just have a address and you don't know ahead of time > > whether it refers to Flash or RAM space, you ought to hold that number > > in an "int" (or "short" or whatever integer type is most appropriate), > > and then convert from that integer type to either a "char *" or a > > "char __pgm *". > > That would leave standard C. You aren't allowed to construct pointers out > of random integers. C leaves integer-to-pointer conversion *implementation-defined*, not undefined, and GCC has always chosen to implement this by (usually) keeping the value unchanged: http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Arrays-and-pointers-implementation.html This works both for default and non-default address spaces. Of course, my suggested implementation would therefore rely on implementation-defined behaviour (but by simply using the __pgm address space, it does so anyway). > That would point to a third address space, call it "undef" :) It would be > superset of default and pgm, conversions between undef to {default,pgm} > are allowed freely (and value preserving, i.e. trivial). That would probably violate the named AS specification, since two different entities in the undef space would share the same pointer value ... Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: FDO and LTO on ARM
Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther : On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka wrote: Did you try using FDO with -Os? FDO should make hot code parts optimized similar to -O3 but leave other pieces optimized for size. Using FDO with -O3 gives you the opposite, cold portions optimized for size while the rest is optimized for speed. FDO with -Os still optimize for size, even in hot parts. I don't think so. Or at least that would be a bug. Shouldn't 'hot' BBs/functions be optimized for speed even at -Os? Hm, I see predict.c indeed returns always false for optimize_size :( It was outcome of discussion held some time ago. I think it was Mark promoting point that users opitmize for size when they use -Os period. I thought we had just the neither cold or hot parts optimized according to optimize_size. I originally wanted to have attribute HOT to overwrite -Os, so the well annotateed sources (i.e. kernel) could compile with -Os by default and explicitely declare the hot parts hot and get them compiled appropriately. With profile feedback however the current logic is binary - i.e. blocks are either hot since their count is bigger than the threshold or cold. We don't really have "I don't really know" state there. In some cases it would make sense - i.e. there are optimizations that we want to do only in the hottest parts of code, but we don't have any logic for that. My plan is to extend ipa-profile to do better hot/cold partitioning first: at the moment we decide on fixed fraction of maximal count in the program. This is unnecesarily conservative for programs with not terribly flat profiles. At IPA level we could collect histogram of counts of instructions (i.e. figure out how much time we spend on instructions executed N times) and then figure out where is the threshold so 99% of executed instructions belongs to hot region. This should give noticeably smaller binaries. I thought we had just the neither cold or hot parts optimized according to optimize_size. So to get resonale speedups you need -O3+FDO. -O3+FDO effectively defaults to -Os in cold portions of program. Well, but unless your training coverage is 100% all parts with no coverage get optimized with -O3 instead of -Os. And I bet coverage for mozilla isn't even close to 100%. Thus I think recommending -O3 for FDO is usually a bad idea. Code with no coverage is cold in our model (as is code executed once or so) and thus optimized for -Os even at -O3+FDO. This is bit aggressive on optimizing for size side. We might consider changing this policy, but so far I didn't see any complains on this... Honza So - did you try FDO with -O2? ;) Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is bug. It is not very thoroughly since it is not really used in practice. Also do you get any warnings on profile mismatches? Perhaps something is wrong to the degree that the relevant part of profile gets misapplied. I don't get any warning on profile mismatches. I only get a "few" missing gcda files warning, but that's expected. Perhaps you could compile one of less trivial files you are sure that are covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps of the compilation so I can double check the profile seems sane. This could be good start to rule out something stupid. Honza Cheers, Mike
gcc-python-plugin finds its first bug in itself
gcc-python-plugin [1] now provides a gcc-with-cpychecker harness that runs gcc with an additional pass that checks CPython API calls (internally, it's using the gcc python plugin to run a python script that does the work). I tried rebuilding the plugin using make CC=../other-build/gcc-with-cpychecker and it found a genuine bug in itself: within this code: 350 PyObject * 351 gcc_Pass_get_by_name(PyObject *cls, PyObject *args, PyObject *kwargs) 352 { 353 const char *name; 354 char *keywords[] = {"name", 355 NULL}; 356 struct opt_pass *result; 357 358 if (!PyArg_ParseTupleAndKeywords(args, kwargs, 359 "s|get_by_name", keywords, 360 &name)) { 361 return NULL; 362 } 363 [...snip...] it found this problem: gcc-python-pass.c: In function ‘gcc_Pass_get_by_name’: gcc-python-pass.c:358:37: error: unknown format char in "s|get_by_name": 'g' [-fpermissive] It turned out that I'd typo-ed the format code: I was erroneously using "|" (signifying that optional args follow), when I meant to use ":" (signifying that the rest of the string is the name of the function, for use in error messages) [2]. Fixed in git; there are a few false positives, which I'm working on fixing now. I'm in two minds about whether this (minor) milestone is one I should mention in public, but I guess it's proof that having a static checker for this kind of mistake is worthwhile :) Dave [1] https://fedorahosted.org/gcc-python-plugin/ [2] fwiw, the API that it's checking is here: http://docs.python.org/c-api/arg.html
Re: FDO and LTO on ARM
On Fri, Aug 5, 2011 at 7:40 AM, Jan Hubicka wrote: > Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther > : > >> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka wrote: > > Did you try using FDO with -Os? FDO should make hot code parts > optimized similar to -O3 but leave other pieces optimized for size. > Using FDO with -O3 gives you the opposite, cold portions optimized > for size while the rest is optimized for speed. >>> >>> FDO with -Os still optimize for size, even in hot parts. >> >> I don't think so. Or at least that would be a bug. Shouldn't 'hot' >> BBs/functions >> be optimized for speed even at -Os? Hm, I see predict.c indeed returns >> always false for optimize_size :( > > It was outcome of discussion held some time ago. I think it was Mark > promoting point that users opitmize for size when they use -Os period. > > I thought we had just the neither cold or hot parts optimized according > to optimize_size. I originally wanted to have attribute HOT to overwrite > -Os, so the well annotateed sources (i.e. kernel) could compile with -Os by > default and explicitely declare the hot parts hot and get them compiled > appropriately. > > With profile feedback however the current logic is binary - i.e. blocks are > either hot since their count is bigger than the threshold or cold. We don't > really have "I don't really know" state there. In some cases it would make > sense - i.e. there are optimizations that we want to do only in the hottest > parts of code, but we don't have any logic for that. For profile summary at function/cgraph_node level, there are three states: hot, unlikely, and normal. At BB/EDGE level, there are three states too, but implementation turns it into 2 states (by querying only 'maybe_hot_bb'): hot and not hot --- instead of 'hot', 'not hot nor cold', and 'cold'. David > > My plan is to extend ipa-profile to do better hot/cold partitioning first: > at the moment we decide on fixed fraction of maximal count in the program. > This is unnecesarily conservative for programs with not terribly flat > profiles. At IPA level we could collect histogram of counts of instructions > (i.e. figure out how much time we spend on instructions executed N times) > and then figure out where is the threshold so 99% of executed instructions > belongs to hot region. This should give noticeably smaller binaries. >> >> I thought we had just the neither cold or hot parts optimized according >> to optimize_size. > > >> >>> So to get resonale >>> speedups you need -O3+FDO. -O3+FDO effectively defaults to -Os in cold >>> portions of program. >> >> Well, but unless your training coverage is 100% all parts with no coverage >> get optimized with -O3 instead of -Os. And I bet coverage for mozilla >> isn't even close to 100%. Thus I think recommending -O3 for FDO is >> usually a bad idea. > > Code with no coverage is cold in our model (as is code executed once or so) > and thus optimized for -Os even at -O3+FDO. This is bit aggressive on > optimizing for size side. We might consider changing this policy, but so far > I didn't see any complains on this... > > Honza >> >> So - did you try FDO with -O2? ;) >> >>> Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is >>> bug. It is not very thoroughly since it is not really used in practice. >>> > Also do you get any warnings on profile mismatches? Perhaps something > is wrong to the degree that the relevant part of profile gets > misapplied. I don't get any warning on profile mismatches. I only get a "few" missing gcda files warning, but that's expected. >>> >>> Perhaps you could compile one of less trivial files you are sure that are >>> covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all >>> dumps >>> of the compilation so I can double check the profile seems sane. This >>> could >>> be good start to rule out something stupid. >>> >>> Honza Cheers, Mike >>> >>> >>> >> > > >
Re: FDO and LTO on ARM
On Fri, Aug 5, 2011 at 12:32 AM, Richard Guenther wrote: > On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka wrote: Did you try using FDO with -Os? FDO should make hot code parts optimized similar to -O3 but leave other pieces optimized for size. Using FDO with -O3 gives you the opposite, cold portions optimized for size while the rest is optimized for speed. >> >> FDO with -Os still optimize for size, even in hot parts. > > I don't think so. Or at least that would be a bug. Shouldn't 'hot' > BBs/functions > be optimized for speed even at -Os? Hm, I see predict.c indeed returns > always false for optimize_size :( That is function level query. At the BB/EDGE level, the condition is refined: The BB (or instruction expansion) will be optimized for size if the bb is not 'hot'. This logic here is probably not ideal. It means that without specifying -Os, only the hot BBs are optimized for speed --> the 'righter' way is 'without -Os, only cold BBs are optimize for size' -- i.e., the lukewarm bbs are also optimize for speed. This will match the function level logic. David > > I thought we had just the neither cold or hot parts optimized according > to optimize_size. > >> So to get resonale >> speedups you need -O3+FDO. -O3+FDO effectively defaults to -Os in cold >> portions of program. > > Well, but unless your training coverage is 100% all parts with no coverage > get optimized with -O3 instead of -Os. And I bet coverage for mozilla > isn't even close to 100%. Thus I think recommending -O3 for FDO is > usually a bad idea. > > So - did you try FDO with -O2? ;) > >> Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is >> bug. It is not very thoroughly since it is not really used in practice. >> Also do you get any warnings on profile mismatches? Perhaps something is wrong to the degree that the relevant part of profile gets misapplied. >>> >>> I don't get any warning on profile mismatches. I only get a "few" >>> missing gcda files warning, but that's expected. >> >> Perhaps you could compile one of less trivial files you are sure that are >> covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps >> of the compilation so I can double check the profile seems sane. This could >> be good start to rule out something stupid. >> >> Honza >>> >>> Cheers, >>> >>> Mike >>> >> >> >> >
[RFC PATCH, i386]: Allow zero_extended addresses (+ problems with reload and offsetable address, "o" constraint)
Hello! Attached patch introduces generation of addr32 prefixed addresses, mainly intended to merge ZERO_EXTRACTed LEA calculations into address. After fixing various inconsistencies with "o" constraints, the patch works surprisingly well (in its current form fixes all reported problems in the PR [1]), but one problem remains w.r.t. handling of "o" constraint. Patched gcc ICEs on gcc.dg/torture/pr47744-2.c with: $ ~/gcc-build-fast/gcc/cc1 -O2 -mx32 -std=gnu99 -quiet pr47744-2.c pr47744-2.c: In function ‘matmul_i16’: pr47744-2.c:40:1: error: insn does not satisfy its constraints: (insn 116 66 67 4 (set (reg:TI 0 ax) (mem:TI (zero_extend:DI (plus:SI (reg:SI 4 si [orig:114 ivtmp.26 ] [114]) (reg:SI 5 di [orig:101 dest_y ] [101]))) [6 MEM[base: dest_y_18, index: ivtmp.26_53, offset: 0B]+0 S16 A128])) pr47744-2.c:34 60 {*movti_internal_rex64} (nil)) pr47744-2.c:40:1: internal compiler error: in reload_cse_simplify_operands, at postreload.c:403 Please submit a full bug report, ... ... due to the fact that the address is not offsetable, and plus ((zero_extend (...)) (const_int ...)) gets rejected from ix86_legitimate_address_p. However, the section "16.8.1 Simple Constraints" of the documentation claims: --quote-- * A nonoffsettable memory reference can be reloaded by copying the address into a register. So if the constraint uses the letter `o', all memory references are taken care of. --/quote-- As I read this sentence, the RTX is forced into a temporary register, and reload tries to satisfy "o" constraint with plus ((reg ...) (const_int ...)), as said at the introduction of "o" constraint a couple of pages earlier. Unfortunately, this does not seem to be the case. Is there anything wrong with my approach, or is there something wrong in reload? 2011-08-05 Uros Bizjak PR target/49781 * config/i386/i386.c (ix86_decompose_address): Allow zero-extended SImode addresses. (ix86_print_operand_address): Handle zero-extended addresses. (memory_address_length): Add length of addr32 prefix for zero-extended addresses. * config/i386/predicates.md (lea_address_operand): Reject zero-extended operands. Patch is otherwise bootstrapped and tested on x86_64-pc-linux-gnu {,-m32} without regressions. [1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49781 Thanks, Uros. Index: config/i386/predicates.md === --- config/i386/predicates.md (revision 177456) +++ config/i386/predicates.md (working copy) @@ -801,6 +801,10 @@ struct ix86_address parts; int ok; + /* LEA handles zero-extend by itself. */ + if (GET_CODE (op) == ZERO_EXTEND) +return false; + ok = ix86_decompose_address (op, &parts); gcc_assert (ok); return parts.seg == SEG_DEFAULT; Index: config/i386/i386.c === --- config/i386/i386.c (revision 177456) +++ config/i386/i386.c (working copy) @@ -11146,6 +11146,14 @@ ix86_decompose_address (rtx addr, struct ix86_addr int retval = 1; enum ix86_address_seg seg = SEG_DEFAULT; + /* Allow zero-extended SImode addresses, + they will be emitted with addr32 prefix. */ + if (TARGET_64BIT + && GET_CODE (addr) == ZERO_EXTEND + && GET_MODE (addr) == DImode + && GET_MODE (XEXP (addr, 0)) == SImode) +addr = XEXP (addr, 0); + if (REG_P (addr)) base = addr; else if (GET_CODE (addr) == SUBREG) @@ -14163,9 +14171,13 @@ ix86_print_operand_address (FILE *file, rtx addr) } else { - /* Print DImode registers on 64bit targets to avoid addr32 prefixes. */ - int code = TARGET_64BIT ? 'q' : 0; + int code = 0; + /* Print SImode registers for zero-extended addresses to force +addr32 prefix. Otherwise print DImode registers to avoid it. */ + if (TARGET_64BIT) + code = (GET_CODE (addr) == ZERO_EXTEND) ? 'l' : 'q'; + if (ASSEMBLER_DIALECT == ASM_ATT) { if (disp) @@ -21776,7 +21788,8 @@ assign_386_stack_local (enum machine_mode mode, en } /* Calculate the length of the memory address in the instruction - encoding. Does not include the one-byte modrm, opcode, or prefix. */ + encoding. Includes addr32 prefix, does not include the one-byte modrm, + opcode, or other prefixes. */ int memory_address_length (rtx addr) @@ -21803,8 +21816,10 @@ memory_address_length (rtx addr) base = parts.base; index = parts.index; disp = parts.disp; - len = 0; + /* Add length of addr32 prefix. */ + len = (GET_CODE (addr) == ZERO_EXTEND); + /* Rule of thumb: - esp as the base always wants an index, - ebp as the base always wants a displacement,
Re: FDO and LTO on ARM
Am Fri 05 Aug 2011 07:49:49 PM CEST schrieb Xinliang David Li : On Fri, Aug 5, 2011 at 12:32 AM, Richard Guenther wrote: On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka wrote: Did you try using FDO with -Os? FDO should make hot code parts optimized similar to -O3 but leave other pieces optimized for size. Using FDO with -O3 gives you the opposite, cold portions optimized for size while the rest is optimized for speed. FDO with -Os still optimize for size, even in hot parts. I don't think so. Or at least that would be a bug. Shouldn't 'hot' BBs/functions be optimized for speed even at -Os? Hm, I see predict.c indeed returns always false for optimize_size :( That is function level query. At the BB/EDGE level, the condition is refined: Well we summarize function profile to: 1) hot 2) normal 3) executed once 4) unlikely We summarize BB profile to: 1) maybe_hot 2) probably_cold (equivalent to !maybe_hot) 3) probably_never_executed Except for executed once that is special thing for function fed by discovery of main() and static ctors/dtors there is 1-1 correspondence in between BB and function predicates. With profile feedback function is hot if it contain BB that is maybe_hot (with feedback it is also probably hot), it is normal if it contain BB that is !probably_never_executed and unlikely if all BBs are probably_never_executed. So with profile feedback the function profile summaries are no more refined that BB ones. Without profile feedback things are more messy and the names of BB settings was more or less invented on what static profile estimate can tell you. Lacking function level profile estimate, we generally consider functions "normal" unless told otherwise in few special cases. We also never autodetect probably_never_executed even though it would make a lot of sense to do so for EH/paths to exit. As I mentioned, I think we should start doing so. Finally optimize_size comes into game that is independent of the summaries above and it is why I added the optimize_XXX_for_size/speed predicates. By default -Os imply optimize for size everything and -O123 optimize for size everything that is maybe_hot (i.e. not quite reliably proven otherwise). In a way I like the current scheme since it is simple and extending it should IMO have some good reason. We could refine -Os behaviour without changing current predicates to optimize for speed in a) functions declared as "hot" by user and BBs in them that are not proved cold. b) based on profile feedback - i.e. we could have two thresholds, BBs with very arge counts wil be probably hot, BBs in between will be maybe hot/normal and BBs with low counts will be cold. This would probably motivate introduction of probably_hot predicate that summarize the above. If we want to refine things, we could also re-consider how we want to behave to BBs with 0 coverage. I.e. if we want to a) consider them "normal" and let the presence of -Os/-O123 to decide whether they are size/speed optimized, b) consider them "cold" since they are not executed at all, c) consider them "cold" in functions that are otherwise covered by the test run and "normal" in case the function is not covered at all (i.e. training X server on particular set of hardware may not convince GCC to optimize for size all the other drivers not covered by the train run). We currently implement B and it sort of work well since users usually train for what matters for them and are happy to see binaries smaller. What I don't like about the a&c is bit of inconsistency with small counts. I.e. count 1 will imply optimizing for size, but roundoff error to 0 will cause it to be optimized for speed that is weird. Of course also flipping the default here would cause significant grown of FDO binaries and users are already unhappy that FDO binaries are too large. Honza
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Was this reproducible for m32c also? I can test it if so...
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
DJ Delorie wrote: > Was this reproducible for m32c also? I can test it if so... The patch simply passes the destination address space through to MODE_CODE_BASE_REG_CLASS and REGNO_MODE_CODE_OK_FOR_BASE_P, to allow targets to make register allocation decisions based on address space. As long as m32c doesn't implement those, just applying the patch wouldn't change anything. But if that capability *would* be helpful on your target, it would certainly be good if you could try it out ... Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: [named address] ice-on-valid: in postreload.c:reload_cse_simplify_operands
Nope, I don't use those :-)
Re: FDO and LTO on ARM
> > In a way I like the current scheme since it is simple and extending it > should IMO have some good reason. We could refine -Os behaviour without > changing current predicates to optimize for speed in > a) functions declared as "hot" by user and BBs in them that are not proved > cold. > b) based on profile feedback - i.e. we could have two thresholds, BBs with > very arge counts wil be probably hot, BBs in between will be maybe > hot/normal and BBs with low counts will be cold. > This would probably motivate introduction of probably_hot predicate that > summarize the above. Introducing a new 'probably_hot' will be very confusing -- unless you also rename 'maybe_hot', but this leads to finer grained control: very_hot, hot, normal, cold, unlikely which can be hard to use. The three state partition (not counting exec_once) seems ok, but 1) the unlikely state does not have controllable parameter 2) hot_bb_count_fraction parameter which is used to determine maybe_hotness is shared for all FDO related passes. It is much more flexible (in terms of tuning) to allow each pass (such as inlining) to define its own thresholds. > > If we want to refine things, we could also re-consider how we want to behave > to BBs with 0 coverage. I.e. if we want to > a) consider them "normal" and let the presence of -Os/-O123 to decide > whether they are size/speed optimized, > b) consider them "cold" since they are not executed at all, > c) consider them "cold" in functions that are otherwise covered by the test > run and "normal" in case the function is not covered at all (i.e. training X > server on particular set of hardware may not convince GCC to optimize for > size all the other drivers not covered by the train run). > > We currently implement B and it sort of work well since users usually train > for what matters for them and are happy to see binaries smaller. Yes -- we assume user will do his best to find representative training data to avoid bad optimizations, so b) should be fine. David > > What I don't like about the a&c is bit of inconsistency with small counts. > I.e. count 1 will imply optimizing for size, but roundoff error to 0 will > cause it to be optimized for speed that is weird. > Of course also flipping the default here would cause significant grown of > FDO binaries and users are already unhappy that FDO binaries are too large. > > Honza > >
The Linux binutils 2.21.53.0.2 is released
This is the beta release of binutils 2.21.53.0.2 for Linux, which is based on binutils 2011 0804 in CVS on sourceware.org plus various changes. It is purely for Linux. All relevant patches in patches have been applied to the source tree. You can take a look at patches/README to see what have been applied and in what order they have been applied. Starting from the 2.21.51.0.3 release, you must remove .ctors/.dtors section sentinels when building glibc or other C run-time libraries. Otherwise, you will run into: http://sourceware.org/bugzilla/show_bug.cgi?id=12343 Starting from the 2.21.51.0.2 release, BFD linker has the working LTO plugin support. It can be used with GCC 4.5 and above. For GCC 4.5, you need to configure GCC with --enable-gold to enable LTO plugin support. Starting from the 2.21.51.0.2 release, binutils fully supports compressed debug sections. However, compressed debug section isn't turned on by default in assembler. I am planning to turn it on for x86 assembler in the future release, which may lead to the Linux kernel bug messages like WARNING: lib/ts_kmp.o (.zdebug_aranges): unexpected non-allocatable section. But the resulting kernel works fine. Starting from the 2.20.51.0.4 release, no diffs against the previous release will be provided. You can enable both gold and bfd ld with --enable-gold=both. Gold will be installed as ld.gold and bfd ld will be installed as ld.bfd. By default, ld.bfd will be installed as ld. You can use the configure option, --enable-gold=both/gold to choose gold as the default linker, ld. IA-32 binary and X64_64 binary tar balls are configured with --enable-gold=both/ld --enable-plugins --enable-threads. Starting from the 2.18.50.0.4 release, the x86 assembler no longer accepts fnstsw %eax fnstsw stores 16bit into %ax and the upper 16bit of %eax is unchanged. Please use fnstsw %ax Starting from the 2.17.50.0.4 release, the default output section LMA (load memory address) has changed for allocatable sections from being equal to VMA (virtual memory address), to keeping the difference between LMA and VMA the same as the previous output section in the same region. For .data.init_task : { *(.data.init_task) } LMA of .data.init_task section is equal to its VMA with the old linker. With the new linker, it depends on the previous output section. You can use .data.init_task : AT (ADDR(.data.init_task)) { *(.data.init_task) } to ensure that LMA of .data.init_task section is always equal to its VMA. The linker script in the older 2.6 x86-64 kernel depends on the old behavior. You can add AT (ADDR(section)) to force LMA of .data.init_task section equal to its VMA. It will work with both old and new linkers. The x86-64 kernel linker script in kernel 2.6.13 and above is OK. The new x86_64 assembler no longer accepts monitor %eax,%ecx,%edx You should use monitor %rax,%ecx,%edx or monitor which works with both old and new x86_64 assemblers. They should generate the same opcode. The new i386/x86_64 assemblers no longer accept instructions for moving between a segment register and a 32bit memory location, i.e., movl (%eax),%ds movl %ds,(%eax) To generate instructions for moving between a segment register and a 16bit memory location without the 16bit operand size prefix, 0x66, mov (%eax),%ds mov %ds,(%eax) should be used. It will work with both new and old assemblers. The assembler starting from 2.16.90.0.1 will also support movw (%eax),%ds movw %ds,(%eax) without the 0x66 prefix. Patches for 2.4 and 2.6 Linux kernels are available at http://www.kernel.org/pub/linux/devel/binutils/linux-2.4-seg-4.patch http://www.kernel.org/pub/linux/devel/binutils/linux-2.6-seg-5.patch The ia64 assembler is now defaulted to tune for Itanium 2 processors. To build a kernel for Itanium 1 processors, you will need to add ifeq ($(CONFIG_ITANIUM),y) CFLAGS += -Wa,-mtune=itanium1 AFLAGS += -Wa,-mtune=itanium1 endif to arch/ia64/Makefile in your kernel source tree. Please report any bugs related to binutils 2.21.53.0.2 to hjl.to...@gmail.com and http://www.sourceware.org/bugzilla/ Changes from binutils 2.21.53.0.1: 1. Update from binutils 2011 0804. 2. Add Intel K1OM support. 3. Allow R_X86_64_64 relocation for x32 and check x32 relocation overflow. PR ld/13048. 4. Support direct call in x86-64 assembly code. PR gas/13046. 5. Add ia32 Google Native Client support. 6. Add .debug_macro section support. 7. Improve gold. 8. Improve VMS support. 9. Improve arm support. 10. Improve hppa support. 11. Improve mips support. 12. Improve mmix support. 13. Improve ppc support. Changes from binutils 2.21.52.0.2: 1. Update from binutils 2011 0716. 2. Fix LTO linker bugs. PRs 12982/12942. 3. Fix rorx support in x86 assembler/disassembler for AVX Programming Reference (June, 2011). 4. Fix an x86-64 ELFOSABI linker regression. 5. Update ELFOSABI_GNU support. PR 12913.
Re: FDO and LTO on ARM
> > > > In a way I like the current scheme since it is simple and extending it > > should IMO have some good reason. We could refine -Os behaviour without > > changing current predicates to optimize for speed in > > a) functions declared as "hot" by user and BBs in them that are not proved > > cold. > > b) based on profile feedback - i.e. we could have two thresholds, BBs with > > very arge counts wil be probably hot, BBs in between will be maybe > > hot/normal and BBs with low counts will be cold. > > This would probably motivate introduction of probably_hot predicate that > > summarize the above. > > Introducing a new 'probably_hot' will be very confusing -- unless you > also rename 'maybe_hot', but this leads to finer grained control: > very_hot, hot, normal, cold, unlikely which can be hard to use. The > three state partition (not counting exec_once) seems ok, but OK, I also preffer to have fewer stages than more ;) > > 1) the unlikely state does not have controllable parameter Well, it is defined as something that is not likely to be executed, so the requirement on count to be less than 1/(number_of_test_runs*2) is very natural and don't seem to need to be tuned. > 2) hot_bb_count_fraction parameter which is used to determine > maybe_hotness is shared for all FDO related passes. It is much more > flexible (in terms of tuning) to allow each pass (such as inlining) to > define its own thresholds. Some people call towards fewer parameters, other towards more, it is always matter of some compromise. So before forking the notion of hotness for individual passes we would need to have some good reasoning on why this is very important. > > > > If we want to refine things, we could also re-consider how we want to behave > > to BBs with 0 coverage. I.e. if we want to > > a) consider them "normal" and let the presence of -Os/-O123 to decide > > whether they are size/speed optimized, > > b) consider them "cold" since they are not executed at all, > > c) consider them "cold" in functions that are otherwise covered by the test > > run and "normal" in case the function is not covered at all (i.e. training X > > server on particular set of hardware may not convince GCC to optimize for > > size all the other drivers not covered by the train run). > > > > We currently implement B and it sort of work well since users usually train > > for what matters for them and are happy to see binaries smaller. > > Yes -- we assume user will do his best to find representative training > data to avoid bad optimizations, so b) should be fine. I also think so, one notable exception are however the hardware drivers where it is inherently hard to test all possible combinations in common use. However I guess one should avoid FDO compiling those for this reason. Honza
gcc-4.6-20110805 is now available
Snapshot gcc-4.6-20110805 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.6-20110805/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.6 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_6-branch revision 177489 You'll find: gcc-4.6-20110805.tar.bz2 Complete GCC MD5=7b55daa94de9a1269d5fe5ea3bacff2f SHA1=c373614567b284dab7efb8b3d1b3ebcba4774b8d Diffs from 4.6-20110729 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.6 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Building C++ with --enable-languages=c,fortran
Hello world, I just noticed that C++ now appears to be built by default, even when only the C and fortran are specified. The configure line ../trunk/configure --prefix=$HOME --enable-languages=c,fortran --with-mpc=/usr/local --with-mpfr=/usr/local leads to the message checking for version 0.11 (revision 0 or later) of PPL... no The following languages will be built: c,c++,fortran,lto I see recent changes by Ian in this area, but nothing in the ChangeLog suggests to me that this was intentional. Any ideas?
Re: Building C++ with --enable-languages=c,fortran
On Sat, Aug 06, 2011 at 12:52:02AM +0200, Thomas Koenig wrote: > Hello world, > > I just noticed that C++ now appears to be built by default, even when > only the C and fortran are specified. The configure line > > > ../trunk/configure --prefix=$HOME --enable-languages=c,fortran > --with-mpc=/usr/local --with-mpfr=/usr/local > > leads to the message > > checking for version 0.11 (revision 0 or later) of PPL... no > > The following languages will be built: c,c++,fortran,lto > > I see recent changes by Ian in this area, but nothing in the ChangeLog > suggests to me that this was intentional. > > Any ideas? It appears the original thread starts here http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01304.html -- Steve
Please replace/augment buildstat entry
Please replace or augment the alphaev68-dec-osf5.1a Test results: 4.4.6, in http://gcc.gnu.org/gcc-4.4/buildstat.html from http://gcc.gnu.org/ml/gcc-testresults/2011-05/msg00074.html to http://gcc.gnu.org/ml/gcc-testresults/2011-08/msg00586.html The reason is explained at the top of the summary: "This replaces my previous entry on 4.4.6 three months ago ...differs only from that earlier submission by the libgomp section ..."
Re: Building C++ with --enable-languages=c,fortran
Thomas Koenig writes: > I just noticed that C++ now appears to be built by default, even when > only the C and fortran are specified. The configure line > > > ../trunk/configure --prefix=$HOME --enable-languages=c,fortran > --with-mpc=/usr/local --with-mpfr=/usr/local > > leads to the message > > checking for version 0.11 (revision 0 or later) of PPL... no > > The following languages will be built: c,c++,fortran,lto > > I see recent changes by Ian in this area, but nothing in the ChangeLog > suggests to me that this was intentional. It is intentional. In current mainline stages 2 and 3 are now by default built with the C++ compiler, not the C compiler. Therefore, the C++ compiler must be built in stages 1 and 2, in order to use to build the stages 2 and 3 compiler. And then of course we build the C++ compiler in stage 3 in order to compare it. The ChangeLog entry says that if --enable-build-poststage1-with-cxx is set, C++ becomes a boot language. That is what you are seeing. I guess what the ChangeLog entry does not say is that --enable-build-poststage1-with-cxx is set by default. Ian
Re: Building C++ with --enable-languages=c,fortran
On Fri, Aug 05, 2011 at 06:51:12PM -0700, Ian Lance Taylor wrote: > Thomas Koenig writes: > > > I just noticed that C++ now appears to be built by default, even when > > only the C and fortran are specified. The configure line > > > > > > ../trunk/configure --prefix=$HOME --enable-languages=c,fortran > > --with-mpc=/usr/local --with-mpfr=/usr/local > > > > leads to the message > > > > checking for version 0.11 (revision 0 or later) of PPL... no > > > > The following languages will be built: c,c++,fortran,lto > > > > I see recent changes by Ian in this area, but nothing in the ChangeLog > > suggests to me that this was intentional. > > It is intentional. In current mainline stages 2 and 3 are now by > default built with the C++ compiler, not the C compiler. Therefore, the > C++ compiler must be built in stages 1 and 2, in order to use to build > the stages 2 and 3 compiler. And then of course we build the C++ > compiler in stage 3 in order to compare it. > > The ChangeLog entry says that if --enable-build-poststage1-with-cxx is > set, C++ becomes a boot language. That is what you are seeing. I guess > what the ChangeLog entry does not say is that > --enable-build-poststage1-with-cxx is set by default. > What are the additional resource requirements? Some of us have old hardware and limited $. -- Steve
Re: Building C++ with --enable-languages=c,fortran
Steve Kargl writes: >> The ChangeLog entry says that if --enable-build-poststage1-with-cxx is >> set, C++ becomes a boot language. That is what you are seeing. I guess >> what the ChangeLog entry does not say is that >> --enable-build-poststage1-with-cxx is set by default. >> > > What are the additional resource requirements? Some of > us have old hardware and limited $. The main additional resource requirement is building libstdc++ in stage 1 (and stages 2 and 3 if you were previously not building the C++ compiler at all). The C++ compiler proper is fairly small by comparison. At present you can use --disable-build-poststage1-with-cxx. However, in the future, I would like to change gcc to always build with C++. Yes, this will take more resources. Ian