[0/5] Improvements to vldN and vstN intrinsics
I've just submitted a merge request for the vldN and vstN intrinsic improvements. There are five related patches, so I thought it might be easier to review the merge if I posted the individual changes here. See: http://www.mail-archive.com/linaro-toolchain@lists.linaro.org/msg00969.html for an example of how this helps. Richard ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[1/5] Improve output of vld3q and vld4q
This first patch optimises the output for vld3q and vld4q functions. These functions expand into two individual vld3 and vld4 instructions, with each instruction setting one (interleaved) half of the output register. The problem was that both instructions treated the output register as an input, whereas only the second one needs to. We therefore treated the output register as being live before the vldNq and generated unnecessary spill code. E.g.: #include void foo (uint32_t *a, uint32_t *b, uint32_t *c) { uint32x4x3_t x, y; x = vld3q_u32 (a); y = vld3q_u32 (b); x.val[0] = vaddq_u32 (x.val[0], y.val[0]); x.val[1] = vaddq_u32 (x.val[1], y.val[1]); x.val[2] = vaddq_u32 (x.val[2], y.val[2]); vst3q_u32 (a, x); } gave: stmfd sp!, {r3, fp} ldr r2, .L2 add fp, sp, #4 vldmia r2, {d16-d21} sub sp, sp, #112 vmovq11, q8 @ ti vmovq12, q9 @ ti vmovq13, q10 @ ti ... where the vldmia is loading the x and y "inputs" to the two vld3q_u32s from the corresponding stack slots. The patch is a backport of: http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01634.html which has been applied to 4.7. No changes were needed for 4.5. Richard gcc/ Backport from mainline: 2011-03-30 Richard Sandiford Ramana Radhakrishnan PR target/43590 * config/arm/neon.md (neon_vld3qa, neon_vld4qa): Remove operand 1 and reshuffle the operands to match. (neon_vld3, neon_vld4): Update accordingly. Index: gcc/config/arm/neon.md === --- gcc/config/arm/neon.md 2011-04-19 13:55:04.0 + +++ gcc/config/arm/neon.md 2011-04-19 13:55:04.0 + @@ -4925,8 +4925,7 @@ (define_expand "neon_vld3" (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] "TARGET_NEON" { - emit_insn (gen_neon_vld3qa (operands[0], operands[0], -operands[1], operands[1])); + emit_insn (gen_neon_vld3qa (operands[0], operands[1], operands[1])); emit_insn (gen_neon_vld3qb (operands[0], operands[0], operands[1], operands[1])); DONE; @@ -4934,12 +4933,11 @@ (define_expand "neon_vld3" (define_insn "neon_vld3qa" [(set (match_operand:CI 0 "s_register_operand" "=w") -(unspec:CI [(mem:CI (match_operand:SI 3 "s_register_operand" "2")) -(match_operand:CI 1 "s_register_operand" "0") +(unspec:CI [(mem:CI (match_operand:SI 2 "s_register_operand" "1")) (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VLD3A)) - (set (match_operand:SI 2 "s_register_operand" "=r") -(plus:SI (match_dup 3) + (set (match_operand:SI 1 "s_register_operand" "=r") +(plus:SI (match_dup 2) (const_int 24)))] "TARGET_NEON" { @@ -4948,7 +4946,7 @@ (define_insn "neon_vld3qa" ops[0] = gen_rtx_REG (DImode, regno); ops[1] = gen_rtx_REG (DImode, regno + 4); ops[2] = gen_rtx_REG (DImode, regno + 8); - ops[3] = operands[2]; + ops[3] = operands[1]; output_asm_insn ("vld3.\t{%P0, %P1, %P2}, [%3]!", ops); return ""; } @@ -5217,8 +5215,7 @@ (define_expand "neon_vld4" (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] "TARGET_NEON" { - emit_insn (gen_neon_vld4qa (operands[0], operands[0], -operands[1], operands[1])); + emit_insn (gen_neon_vld4qa (operands[0], operands[1], operands[1])); emit_insn (gen_neon_vld4qb (operands[0], operands[0], operands[1], operands[1])); DONE; @@ -5226,12 +5223,11 @@ (define_expand "neon_vld4" (define_insn "neon_vld4qa" [(set (match_operand:XI 0 "s_register_operand" "=w") -(unspec:XI [(mem:XI (match_operand:SI 3 "s_register_operand" "2")) -(match_operand:XI 1 "s_register_operand" "0") +(unspec:XI [(mem:XI (match_operand:SI 2 "s_register_operand" "1")) (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VLD4A)) - (set (match_operand:SI 2 "s_register_operand" "=r") -(plus:SI (match_dup 3) + (set (match_operand:SI 1 "s_register_operand" "=r") +(plus:SI (match_dup 2) (const_int 32)))] "TARGET_NEON" { @@ -5241,7 +5237,7 @@ (define_insn "neon_vld4qa" ops[1] = gen_rtx_REG (DImode, regno + 4); ops[2] = gen_rtx_REG (DImode, regno + 8); ops[3] = gen_rtx_REG (DImode, regno + 12); - ops[4] = operands[2]; + ops[4] = operands[1]; output_asm_insn ("vld4.\t{%P0, %P1, %P2, %P3}, [%4]!", ops); return ""; } ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[2/5] Remodel the vldN and vstN patterns
The patterns for the Neon vld and vst intrinsics used the following sort of construct to refer to memory: (mem:FOO (match_operand:SI X "register_operand" "r")) This patch changes them to use: (match_operand:FOO' X "neon_struct_operand" "(=)Um") instead. This allows the loads to use post-increment addresses as well as bare registers, and also matches the form that the vec_load_lanes and vec_store_lanes optabs need. (Those optabs will be in a later autovectorisation merge.) The patch is a backport of: http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01996.html which has been applied to 4.7. There are three differences in the 4.5 version: * Our 4.5 code prints alignments as "[rN, :ALIGN]" rather than "[rN:ALIGN]". I've fixed that here. The initial commit to FSF trunk used the correct form, so there isn't a separate fix that could be backported. * 4.5 doesn't have MEM_REF, so neon_dereference_pointer uses an INDIRECT_REF instead. * 4.5 defines the mode attributes in neon.md rather than in a separate iterators.md. Richard gcc/ Backport from mainline: 2011-04-12 Richard Sandiford * config/arm/arm.c (arm_print_operand): Use MEM_SIZE to get the size of a '%A' memory reference. (T_DREG, T_QREG): New neon_builtin_type_bits. (arm_init_neon_builtins): Assert that the load and store operands are neon_struct_operands. (locate_neon_builtin_icode): Provide the neon_builtin_type_bits. (NEON_ARG_MEMORY): New builtin_arg. (neon_dereference_pointer): New function. (arm_expand_neon_args): Add a neon_builtin_type_bits argument. Handle NEON_ARG_MEMORY. (arm_expand_neon_builtin): Update after above interface changes. Use NEON_ARG_MEMORY for loads and stores. * config/arm/predicates.md (neon_struct_operand): New predicate. * config/arm/neon.md (V_two_elem): Tweak formatting. (V_three_elem): Use BLKmode for accesses that have no associated mode. (neon_vld1, neon_vld1_dup) (neon_vst1_lane, neon_vst1, neon_vld2) (neon_vld2_lane, neon_vld2_dup, neon_vst2) (neon_vst2_lane, neon_vld3, neon_vld3_lane) (neon_vld3_dup, neon_vst3, neon_vst3_lane) (neon_vld4, neon_vld4_lane, neon_vld4_dup) (neon_vst4): Replace pointer operand with a memory operand. Use %A in the output template. (neon_vld3qa, neon_vld3qb, neon_vst3qa) (neon_vst3qb, neon_vld4qa, neon_vld4qb) (neon_vst4qa, neon_vst4qb): Likewise, but halve the width of the memory access. Remove post-increment. * config/arm/neon-testgen.ml: Allow addresses to have an alignment. gcc/testsuite/ Backport from mainline: 2011-04-12 Richard Sandiford * gcc.target/arm/neon-vld3-1.c: New test. * gcc.target/arm/neon-vst3-1.c: New test. * gcc.target/arm/neon/v*.c: Regenerate. Index: gcc/config/arm/arm.c === --- gcc/config/arm/arm.c2011-04-20 08:29:44.0 + +++ gcc/config/arm/arm.c2011-04-20 09:32:44.0 + @@ -16847,7 +16847,7 @@ arm_print_operand (FILE *stream, rtx x, { rtx addr; bool postinc = FALSE; - unsigned align, modesize, align_bits; + unsigned align, memsize, align_bits; gcc_assert (GET_CODE (x) == MEM); addr = XEXP (x, 0); @@ -16862,12 +16862,12 @@ arm_print_operand (FILE *stream, rtx x, instruction (for some alignments) as an aid to the memory subsystem of the target. */ align = MEM_ALIGN (x) >> 3; - modesize = GET_MODE_SIZE (GET_MODE (x)); + memsize = INTVAL (MEM_SIZE (x)); /* Only certain alignment specifiers are supported by the hardware. */ - if (modesize == 16 && (align % 32) == 0) + if (memsize == 16 && (align % 32) == 0) align_bits = 256; - else if ((modesize == 8 || modesize == 16) && (align % 16) == 0) + else if ((memsize == 8 || memsize == 16) && (align % 16) == 0) align_bits = 128; else if ((align % 8) == 0) align_bits = 64; @@ -16875,7 +16875,7 @@ arm_print_operand (FILE *stream, rtx x, align_bits = 0; if (align_bits != 0) - asm_fprintf (stream, ", :%d", align_bits); + asm_fprintf (stream, ":%d", align_bits); asm_fprintf (stream, "]"); @@ -18398,12 +18398,14 @@ enum neon_builtin_type_bits { T_V2SI = 0x0004, T_V2SF = 0x0008, T_DI= 0x0010, + T_DREG = 0x001F, T_V16QI = 0x0020, T_V8HI = 0x0040, T_V4SI = 0x0080, T_V4SF = 0x0100, T_V2DI = 0x0200, T_TI = 0x0400, + T_QREG = 0x07E0, T_EI = 0x0800, T_OI = 0x1000 }; @@ -19049,10 +19051,9 @@ arm_init_neon_builtins (void) if (is_load && k == 1) { /* Neon load patterns always have
[3/5] Allow arrays of vectors to be stored in registers
This patch allows the target to override MAX_FIXED_MODE_SIZE for specific kinds of array. We can then give a non-BLK mode to things like uint32x2x4_t, which in turn allows them to be stored in registers. The patch is a backport of: http://gcc.gnu.org/ml/gcc-patches/2011-03/msg02192.html which Richard Guenther approved in principle, but which can't be applied yet because of 4/5. The only difference in the 4.5 version is that 4.5 still uses the old target hook definition scheme, rather than 4.7's target.def. Richard gcc/ * hooks.h (hook_bool_mode_uhwi_false): Declare. * hooks.c (hook_bool_mode_uhwi_false): New function. * doc/tm.texi (TARGET_ARRAY_MODE_SUPPORTED_P): Document. * target.h (array_mode_supported_p): New hook. * target-def.h (TARGET_ARRAY_MODE_SUPPORTED_P): Define if undefined. (TARGET_INITIALIZER): Include it. * stor-layout.c (mode_for_array): New function. (layout_type): Use it. * config/arm/arm.c (arm_array_mode_supported_p): New function. (TARGET_ARRAY_MODE_SUPPORTED_P): Define. Index: gcc/hooks.h === --- gcc/hooks.h 2011-04-19 14:14:01.0 + +++ gcc/hooks.h 2011-04-19 16:19:06.0 + @@ -32,6 +32,8 @@ extern bool hook_bool_const_int_const_in extern bool hook_bool_mode_false (enum machine_mode); extern bool hook_bool_mode_const_rtx_false (enum machine_mode, const_rtx); extern bool hook_bool_mode_const_rtx_true (enum machine_mode, const_rtx); +extern bool hook_bool_mode_uhwi_false (enum machine_mode, + unsigned HOST_WIDE_INT); extern bool hook_bool_tree_false (tree); extern bool hook_bool_const_tree_false (const_tree); extern bool hook_bool_tree_true (tree); Index: gcc/hooks.c === --- gcc/hooks.c 2011-04-19 14:14:01.0 + +++ gcc/hooks.c 2011-04-19 16:19:06.0 + @@ -86,6 +86,15 @@ hook_bool_mode_const_rtx_true (enum mach return true; } +/* Generic hook that takes (enum machine_mode, unsigned HOST_WIDE_INT) + and returns false. */ +bool +hook_bool_mode_uhwi_false (enum machine_mode mode ATTRIBUTE_UNUSED, + unsigned HOST_WIDE_INT value ATTRIBUTE_UNUSED) +{ + return false; +} + /* Generic hook that takes (FILE *, const char *) and does nothing. */ void hook_void_FILEptr_constcharptr (FILE *a ATTRIBUTE_UNUSED, const char *b ATTRIBUTE_UNUSED) Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi 2011-04-19 14:14:01.0 + +++ gcc/doc/tm.texi 2011-04-19 16:38:08.0 + @@ -4367,6 +4367,34 @@ insns involving vector mode @var{mode}. must have move patterns for this mode. @end deftypefn +@deftypefn {Target Hook} bool TARGET_ARRAY_MODE_SUPPORTED_P (enum machine_mode @var{mode}, unsigned HOST_WIDE_INT @var{nelems}) +Return true if GCC should try to use a scalar mode to store an array +of @var{nelems} elements, given that each element has mode @var{mode}. +Returning true here overrides the usual @code{MAX_FIXED_MODE} limit +and allows GCC to use any defined integer mode. + +One use of this hook is to support vector load and store operations +that operate on several homogeneous vectors. For example, ARM Neon +has operations like: + +@smallexample +int8x8x3_t vld3_s8 (const int8_t *) +@end smallexample + +where the return type is defined as: + +@smallexample +typedef struct int8x8x3_t +@{ + int8x8_t val[3]; +@} int8x8x3_t; +@end smallexample + +If this hook allows @code{val} to have a scalar mode, then +@code{int8x8x3_t} can have the same mode. GCC can then store +@code{int8x8x3_t}s in registers rather than forcing them onto the stack. +@end deftypefn + @node Scalar Return @subsection How Scalar Function Values Are Returned @cindex return values in registers Index: gcc/target.h === --- gcc/target.h2011-04-19 14:14:01.0 + +++ gcc/target.h2011-04-19 16:38:08.0 + @@ -764,6 +764,9 @@ struct gcc_target for further details. */ bool (* vector_mode_supported_p) (enum machine_mode mode); + /* See tm.texi. */ + bool (* array_mode_supported_p) (enum machine_mode, unsigned HOST_WIDE_INT); + /* Compute a (partial) cost for rtx X. Return true if the complete cost has been computed, and false if subexpressions should be scanned. In either case, *TOTAL contains the cost result. */ Index: gcc/target-def.h === --- gcc/target-def.h2011-04-19 14:14:01.0 + +++ gcc/target-def.h2011-04-19 16:38:08.0 + @@ -553,6 +553,10 @@ #define TARGET_FIXED_POINT_SUPPORTED_P d #define TARGET_VECTOR_MODE_SUPPORTED_P hook_bool_mode_false #endif +#ifndef TARGET_ARRAY_MODE_SUPPORTED_P +
[4/5] Convert LEGITIMATE_CONSTANT_P into a hook and add a more argument
This patch converts LEGITIMATE_CONSTANT_P into a target hook and passes along the mode of the constant. This can then be used by 5/5. The patch is a version of: http://gcc.gnu.org/ml/gcc-patches/2011-04/msg00195.html which is still pending review after two pings. It seems pretty simple though, so I think it's worth backporting now rather than waiting for upstream approval. The backport is very much a cut-down version. Rather than convert all targets to the new hook, I've kept LEGITIMATE_CONSTANT_P around and made it the default implementation of the new hook. Only ARM defines the hook directly. Note that the ARM definition is supposed to be identical to the old LEGITIMATE_CONSTANT_P version. Only 5/5 is meant to change it. Richard gcc/ * doc/tm.texi (LEGITIMATE_CONSTANT_P): Replace with... (TARGET_LEGITIMATE_CONSTANT_P): ...this. * target.h (gcc_target): Add legitimate_constant_p. * target-def.h (TARGET_LEGITIMATE_CONSTANT_P): Define. (TARGET_INITIALIZER): Include it. * calls.c (precompute_register_parameters): Replace uses of LEGITIMATE_CONSTANT_P with targetm.legitimate_constant_p. (emit_library_call_value_1): Likewise. * expr.c (move_block_to_reg, can_store_by_pieces, emit_move_insn) (compress_float_constant, emit_push_insn, expand_expr_real_1): Likewise. * recog.c (general_operand, immediate_operand): Likewise. * reload.c (find_reloads_toplev, find_reloads_address_part): Likewise. * reload1.c (init_eliminable_invariants): Likewise. * targhooks.h (default_legitimate_constant_p); Declare. * targhooks.c (default_legitimate_constant_p): New function. * config/arm/arm-protos.h (arm_cannot_force_const_mem): Delete. * config/arm/arm.h (ARM_LEGITIMATE_CONSTANT_P): Likewise. (THUMB_LEGITIMATE_CONSTANT_P, LEGITIMATE_CONSTANT_P): Likewise. * config/arm/arm.c (TARGET_LEGITIMATE_CONSTANT_P): Define. (arm_legitimate_constant_p_1, thumb_legitimate_constant_p) (arm_legitimate_constant_p): New functions. (arm_cannot_force_const_mem): Make static. Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi 2011-04-19 16:38:08.0 + +++ gcc/doc/tm.texi 2011-04-19 16:38:15.0 + @@ -2642,8 +2642,8 @@ instruction for loading an immediate val register, so @code{PREFERRED_RELOAD_CLASS} returns @code{NO_REGS} when @var{x} is a floating-point constant. If the constant can't be loaded into any kind of register, code generation will be better if -@code{LEGITIMATE_CONSTANT_P} makes the constant illegitimate instead -of using @code{PREFERRED_RELOAD_CLASS}. +@code{TARGET_LEGITIMATE_CONSTANT_P} makes the constant illegitimate instead +of using @code{TARGET_PREFERRED_RELOAD_CLASS}. If an insn has pseudos in it after register allocation, reload will go through the alternatives and call repeatedly @code{PREFERRED_RELOAD_CLASS} @@ -5628,13 +5628,13 @@ addresses. Many RISC machines have no m You may assume that @var{addr} is a valid address for the machine. @end defmac -@defmac LEGITIMATE_CONSTANT_P (@var{x}) -A C expression that is nonzero if @var{x} is a legitimate constant for -an immediate operand on the target machine. You can assume that -@var{x} satisfies @code{CONSTANT_P}, so you need not check this. In fact, -@samp{1} is a suitable definition for this macro on machines where -anything @code{CONSTANT_P} is valid. -@end defmac +@deftypefn {Target Hook} bool TARGET_LEGITIMATE_CONSTANT_P (enum machine_mode @var{mode}, rtx @var{x}) +This hook returns true if @var{x} is a legitimate constant for a +@var{mode}-mode immediate operand on the target machine. You can assume that +@var{x} satisfies @code{CONSTANT_P}, so you need not check this. + +The default definition returns true. +@end deftypefn @deftypefn {Target Hook} rtx TARGET_DELEGITIMIZE_ADDRESS (rtx @var{x}) This hook is used to undo the possibly obfuscating effects of the Index: gcc/target.h === --- gcc/target.h2011-04-19 16:38:08.0 + +++ gcc/target.h2011-04-19 16:38:16.0 + @@ -645,7 +645,10 @@ struct gcc_target /* Return true if the target supports conditional execution. */ bool (* have_conditional_execution) (void); - /* True if the constant X cannot be placed in the constant pool. */ + /* See tm.texi. */ + bool (* legitimate_constant_p) (enum machine_mode, rtx); + +/* True if the constant X cannot be placed in the constant pool. */ bool (* cannot_force_const_mem) (rtx); /* True if the insn X cannot be duplicated. */ Index: gcc/target-def.h === --- gcc/target-def.h2011-04-19 16:38:08.0 + +++ gcc/target-def.h2011-04-19 16:38:16.0 + @@ -563,6 +563,7 @@ #define TARGE
[5/5] Fix PR target/46329
This patch handles moves involving structure constants. It's a backport of: http://gcc.gnu.org/ml/gcc-patches/2011-04/msg00200.html which Richard Earnshaw has approved, but which cannot be applied yet because it depends on 4/5. The patch is needed because 3/5 would otherwise expose new instances of the PR. Richard gcc/ PR target/46329 * config/arm/arm.c (arm_legitimate_constant_p_1): Return false for all Neon struct constants. gcc/testsuite/ From Richard Earnshaw PR target/46329 * gcc.target/arm/pr46329.c: New test. Index: gcc/config/arm/arm.c === --- gcc/config/arm/arm.c2011-04-19 16:38:16.0 + +++ gcc/config/arm/arm.c2011-04-20 07:54:11.0 + @@ -140,7 +140,7 @@ static void arm_internal_label (FILE *, static void arm_output_mi_thunk (FILE *, tree, HOST_WIDE_INT, HOST_WIDE_INT, tree); static bool arm_have_conditional_execution (void); -static bool arm_cannot_force_const_mem (enum machine_mode, rtx); +static bool arm_cannot_force_const_mem (rtx); static bool arm_legitimate_constant_p (enum machine_mode, rtx); static bool arm_rtx_costs_1 (rtx, enum rtx_code, int*, bool); static bool arm_size_rtx_costs (rtx, enum rtx_code, enum rtx_code, int *); @@ -6465,8 +6465,14 @@ arm_tls_referenced_p (rtx x) When generating pic allow anything. */ static bool -arm_legitimate_constant_p_1 (enum machine_mode mode ATTRIBUTE_UNUSED, rtx x) +arm_legitimate_constant_p_1 (enum machine_mode mode, rtx x) { + /* At present, we have no support for Neon structure constants, so forbid + them here. It might be possible to handle simple cases like 0 and -1 + in future. */ + if (TARGET_NEON && VALID_NEON_STRUCT_MODE (mode)) +return false; + return flag_pic || !label_mentioned_p (x); } Index: gcc/testsuite/gcc.target/arm/pr46329.c === --- /dev/null 2010-10-05 15:55:33.0 + +++ gcc/testsuite/gcc.target/arm/pr46329.c 2011-04-19 16:38:16.0 + @@ -0,0 +1,9 @@ +/* { dg-options "-O2" } */ +/* { dg-add-options arm_neon } */ + +int __attribute__ ((vector_size (32))) x; +void +foo (void) +{ + x <<= x; +} ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Question on compressed vmlinux .got and .bss sections
On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote: > On Wed, 20 Apr 2011, Shawn Guo wrote: > > > On Tue, Apr 19, 2011 at 04:23:09PM +0100, Dave Martin wrote: > > > Hopefully this explains what's going on, but what are you trying > > > to achieve exactly? > > > > > Thanks a ton, Dave. It does explain what I'm seeing, and your > > explanation looks like a very good learning material. > > > > I'm running into a problem with John Bonies' append-dtb-to-zImage > > patch. That is the header of dtb was overwritten by uart_base > > value. John's patch did fix up .bss entries in .got to move them > > behind dtb image. But as you explained, when uart_base is defined > > as static one, its address is fixed up in pc-relative way at link > > time, and John's patch does not help it, hence the write to > > uart_base at runtime overwrites dtb image. > > > > What do you think is the right fix to this problem? Forbid the use > > of static uninitialized variable? I'm afraid not. Is it possible > > to fix up the cases like uart_base here at runtime? > > You must not use static variable in the decompressor. For one thing, > that breaks the ability to XIP the decompressor code and move writable > data elsewhere. > > So the fix is indeed to _not_ declare any global variable as static in > this case. After some thinking about this, I think I agree. Having to relocate a GOT-full of addresses many of which are actually at fixed PC-relative offsets just for this capability is a bit annoying, but the GNU tools don't support other models very well. We might be able to reduce the size of the GOT by building with -fvisibility=hidden, and making judicious use of "extern" on all data declarations/definitions: [gcc-4.4.info] `extern' declarations are not affected by `-fvisibility', so a lot of code can be recompiled with `-fvisibility=hidden' with no modifications. However, this means that calls to `extern' functions with no explicit visibility will use the PLT, so it is more effective to use `__attribute ((visibility))' and/or `#pragma GCC visibility' to tell the compiler which `extern' declarations should be treated as hidden. This only seems to work reliably for data definitions; plus the toolchain behaviour may "evolve" with respect to obscure features like this. So if we wanted to achieve such a thing reliably, we'd probably need explicit visibility attributes on the affected declarations. The advantage is unlikely to be huge though since the GOT is small anyway; and we wouldn't be able to throw away the GOT relocation code completely, beacuse of the need to relocate bss references... Cheers ---Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Question on compressed vmlinux .got and .bss sections
On Wed, 20 Apr 2011, Dave Martin wrote: > On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote: > > You must not use static variable in the decompressor. For one thing, > > that breaks the ability to XIP the decompressor code and move writable > > data elsewhere. > > > > So the fix is indeed to _not_ declare any global variable as static in > > this case. > > After some thinking about this, I think I agree. > > Having to relocate a GOT-full of addresses many of which are actually at > fixed PC-relative offsets just for this capability is a bit annoying, > but the GNU tools don't support other models very well. You cannot relocate PC-relative offsets at run time. Those references are spread throughout the code into literal pools. Forcing all references to go through the GOT makes it possible for the code to relocate selected parts of itself at run time. > We might be able to reduce the size of the GOT by building with > -fvisibility=hidden, and making judicious use of "extern" on all > data declarations/definitions: > > [gcc-4.4.info] > `extern' declarations are not affected by `-fvisibility', so a lot > of code can be recompiled with `-fvisibility=hidden' with no > modifications. However, this means that calls to `extern' > functions with no explicit visibility will use the PLT, so it is > more effective to use `__attribute ((visibility))' and/or `#pragma > GCC visibility' to tell the compiler which `extern' declarations > should be treated as hidden. > > This only seems to work reliably for data definitions; plus the > toolchain behaviour may "evolve" with respect to obscure features > like this. That doesn't solve the problem at all. In this case, we really want _all_ data references to go through the GOT, meaning that everything would have to be marked extern. The only references which are OK to be PC relative are read-only references, and therefore they can just be marked as static const. > So if we wanted to achieve such a thing reliably, we'd > probably need explicit visibility attributes on the affected > declarations. Like I said, it's about all of them. > The advantage is unlikely to be huge though since the GOT is small anyway; > and we wouldn't be able to throw away the GOT relocation code completely, > beacuse of the need to relocate bss references... In fact, all that remains in the GOT, assuming that const data is marked static, are .bss references. Again, for simplicity's sake, we don't support initialized and writable global variables as in the XIP case those would have to be copied into RAM and the GOT patched accordingly. In practice this is not hard to achieve. To ensure that, we simply discard the .data early in the linker script. Nicolas ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Question on compressed vmlinux .got and .bss sections
Hi, On Wed, Apr 20, 2011 at 1:42 PM, Nicolas Pitre wrote: > On Wed, 20 Apr 2011, Dave Martin wrote: > >> On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote: >> > You must not use static variable in the decompressor. For one thing, >> > that breaks the ability to XIP the decompressor code and move writable >> > data elsewhere. >> > >> > So the fix is indeed to _not_ declare any global variable as static in >> > this case. >> >> After some thinking about this, I think I agree. >> >> Having to relocate a GOT-full of addresses many of which are actually at >> fixed PC-relative offsets just for this capability is a bit annoying, >> but the GNU tools don't support other models very well. > > You cannot relocate PC-relative offsets at run time. Those references > are spread throughout the code into literal pools. Forcing all > references to go through the GOT makes it possible for the code to > relocate selected parts of itself at run time. My point was that relocatability implies overhead, and the GOT potentially contains a load of relocations for code and read-only data which will never get moved in practice. For writable/uninitialised data, it's different of course -- we often will need to relocate that in real situations (as observed here). I'd guessed that only part of the GOT in the compressed loader was addressing such data, but actually, it seems to be pretty much all of it, as you suggest. So the number of useless relocations, and any associated overhead, looks low (if any). > >> We might be able to reduce the size of the GOT by building with >> -fvisibility=hidden, and making judicious use of "extern" on all >> data declarations/definitions: >> >> [gcc-4.4.info] >> `extern' declarations are not affected by `-fvisibility', so a lot >> of code can be recompiled with `-fvisibility=hidden' with no >> modifications. However, this means that calls to `extern' >> functions with no explicit visibility will use the PLT, so it is >> more effective to use `__attribute ((visibility))' and/or `#pragma >> GCC visibility' to tell the compiler which `extern' declarations >> should be treated as hidden. >> >> This only seems to work reliably for data definitions; plus the >> toolchain behaviour may "evolve" with respect to obscure features >> like this. > > That doesn't solve the problem at all. In this case, we really want > _all_ data references to go through the GOT, meaning that everything > would have to be marked extern. The only references which are OK to be > PC relative are read-only references, and therefore they can just be > marked as static const. > >> So if we wanted to achieve such a thing reliably, we'd >> probably need explicit visibility attributes on the affected >> declarations. > > Like I said, it's about all of them. > >> The advantage is unlikely to be huge though since the GOT is small anyway; >> and we wouldn't be able to throw away the GOT relocation code completely, >> beacuse of the need to relocate bss references... > > In fact, all that remains in the GOT, assuming that const data is marked > static, are .bss references. Again, for simplicity's sake, we don't > support initialized and writable global variables as in the XIP case > those would have to be copied into RAM and the GOT patched accordingly. > In practice this is not hard to achieve. To ensure that, we simply > discard the .data early in the linker script. Sure -- my observations were simply based around the fact that we're using the tools to do something they don't feel well adapted to, compared with other tools with a more embedded/bare-metal focus. So if there were a better or more correct way to use the tools to get the results we need, it would be worth considering. But from the discussion it sounds like the code already does pretty much the best thing possible anyway. Cheers ---Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Question on compressed vmlinux .got and .bss sections
On Wed, 20 Apr 2011, Dave Martin wrote: > Hi, > > On Wed, Apr 20, 2011 at 1:42 PM, Nicolas Pitre > wrote: > > On Wed, 20 Apr 2011, Dave Martin wrote: > > > >> On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote: > >> > You must not use static variable in the decompressor. For one thing, > >> > that breaks the ability to XIP the decompressor code and move writable > >> > data elsewhere. > >> > > >> > So the fix is indeed to _not_ declare any global variable as static in > >> > this case. > >> > >> After some thinking about this, I think I agree. > >> > >> Having to relocate a GOT-full of addresses many of which are actually at > >> fixed PC-relative offsets just for this capability is a bit annoying, > >> but the GNU tools don't support other models very well. > > > > You cannot relocate PC-relative offsets at run time. Those references > > are spread throughout the code into literal pools. Forcing all > > references to go through the GOT makes it possible for the code to > > relocate selected parts of itself at run time. > > My point was that relocatability implies overhead, and the GOT > potentially contains a load of relocations for code and read-only data > which will never get moved in practice. Sure, for code (already implicit) or ro data, using GOTOFF relocs is perfectly fine. As long as the relevant data is marked const then there is no issue also marking it static, at which point the same effect as -fvisibility=hidden is achieved i.e. no GOT entries are allocated. > For writable/uninitialised data, it's different of course -- we often > will need to relocate that in real situations (as observed here). I'd > guessed that only part of the GOT in the compressed loader was > addressing such data, but actually, it seems to be pretty much all of > it, as you suggest. Yes, and in practice it contains only between 6 and 8 entries depending on the config used. And all of them are references to .bss variables. So the overhead is pretty small. Nicolas___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
[OT] ti-omap-tools - script for setting up DVSDK 3 with various toolchains
Here's a script for installing TI's DVSDK 3: https://bitbucket.org/thayne/ti-omap-tools/src Works with - CodeSourcery - OpenEmbedded - Linaro It will download the bazillion dependencies scattered across TI's site and makes it easier to gut the DVSDK's hard-coded paths to work for your setup. The DVSDK 4 isn't used because it is completely different from the DVSDK 3 and is much more difficult to root the hard paths and checks out of. AJ ONeal ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Linaro GCC 4.5 and 4.6 2011-04 released
The Linaro Toolchain Working Group is pleased to announce the release of both Linaro GCC 4.5 and Linaro GCC 4.6. Linaro GCC 4.5 2011.04 is the ninth release in the 4.5 series. Based off the latest GCC 4.5.2+svn171921, it adds new optimisations, support for Android, and fixes for many of the issues found in the last month. Linaro GCC 4.6 2011.04 is the second release in the 4.6 series. Based off the latest GCC 4.6.0+svn171921, it is the first supported release of the new series and includes a significant number of mainstreamed patches from 4.5. Interesting changes in 4.6 include: * Updates to 4.6.0+r171921 * Adds conditional store sinking to the vectoriser * Brings in a significant number of the Linaro GCC 4.5 patches that are in mainline Interesting changes in 4.5 include: * Updates to 4.5.2+r172013 * Disables the shrink wrap optimisation by default * Adds support for swing-modulo scheduling (SMS) on ARM * Adds support for Android and the Bionic C library * Optimises -fvar-tracking, greatly reducing memory used when compiling large files (seen in QEMU) Fixes: * 'volatile' being ignored on volatile struct members * A potential register clobber in arm_negdi2 * An error in libgcc that prevented it being built with -Os * Multiple shrink wrap bugs (LP: #731665, 721023, 736081, 758082, 730860, 736439, 721023) * LP: #730440 incorrect immediate for movt (seen in Firebird) * LP: #728315 extension elimination pass mishandles subregs of promoted variables (seen on MIPS) * LP: #675347 volatile int causes inline assembly build failure (seen in Qt) SMS is an optimisation that works on innermost loops and reorders the instructions by overlapping different locations. An example is that the values for the next loop may be loaded during the current loop, making the values already ready when the next loop starts. SMS is disabled by default. To try it, add the options '-fmodulo-sched -fmodulo-sched-allow-regmoves'. The source tarball is available from: https://launchpad.net/gcc-linaro/+milestone/4.5-2011.04-0 https://launchpad.net/gcc-linaro/+milestone/4.6-2011.04-0 Downloads are available from the Linaro GCC page on Launchpad: https://launchpad.net/gcc-linaro -- Michael ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Linaro GDB 7.2 2011-04 released
The Linaro Toolchain Working Group is pleased to announce the release of Linaro GDB 7.2. Linaro GDB 7.2 2011.04 is the fifth release in the 7.2 series. Based off the latest GDB 7.2, it includes a number of ARM-focused bug fixes. This release fixes: * LP: #684218 Failure to backtrace out of glibc system call stubs * LP: #667309 failed to single step over bad thumb->arm boundary * Fix accessing "fpscr" register The source tarball is available at: https://launchpad.net/gdb-linaro/+milestone/7.2-2011.04-0 More information on Linaro GDB is available at: https://launchpad.net/gdb-linaro -- Michael ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain