Re: [GSoC] How to get started with the isl code generation
Hi Tobias, I tried to incorporate all your comments in the following patch. It also contains traversing of ISL AST and its dump to a file. You can find out more about this at the following link http://romangareev.blogspot.ru/2014/05/gsoc-report-i.html -- Cheers, Roman Gareev patch Description: Binary data
RE: Reducing Register Pressure through Live range Shrinking through Loops!!
On Friday, May 23, 2014 1:46 AM Vladimir Makarov wrote: On 05/21/2014 12:25 AM, Ajit Kumar Agarwal wrote: > Hello All: > > Simpson does the Live range shrinking and reduction of register > pressure by using the computation that are not load and store but the > arithmetic computation. The computation where the operands and registers are > live at the entry and exit of the basic block but not touched inside the > block then the computation is moved at the end of the block the reducing the > register pressure inside the block by one. Extension of the Simpson work by > extending the computation not being touched inside the basic block to the > spanning of the Loops. If the Live ranges spans the Loops and live at the > entry and exit of the Loop but the computation is not being touched inside > the Loops then the computation is moved after the exit of the Loop. > > REDUCTION OF REGISTER PRESSURE THROUGH LIVE RANGE SHRINKING INSIDE THE > LOOPS > > for each Loop starting from inner to outer do the following begin > RELIEFIN(i) = null if i is the entry of the cfg. > Else > For all predecessors j RELIEFOUT(j) > RELIEFOUT(i) = RELIEFIN(i) exposed union relief > INSERT(I,j) = RELIEFOUT(i) RELIEFIN(i) Intersection > Live(i) > end > > The Simpson approach does takes the nesting depth into consideration > of placing the computation and the relieve of the register pressure. Simpson > approach doesn't takes into consideration the computation which spans > throughout the loop and the operands and results are live at the entry of the > Loop and exit of the Loop but not touched inside the Loops can be useful in > reduction of register pressure inside the Loops. This approach will be > useful in Region Based Register Allocator for Live Range Splitting at the > Region Boundaries. > > Extension of the Simpson approach is to consider the data flow > analysis with respect to the given Loop rather than having it for > entire control flow graph. This data flow analysis starts from the > inner loop and extends it to the outer loop. If the reference is not > through the nested depth or with some depth then the computation can > be placed accordingly. For register allocator by Graph coloring the > live ranges that are with respect to operands and results of the > computation are taken into consideration and for the above approach > put into the stack during simplification phase of Graph Coloring so > that there is a chance of getting such Live ranges colorable and thus > reduces the register pressure. This is extended to splitting approach > based on containment of Live ranges > > OPTIMAL PLACEMENT OF THE COMPUTATION FOR SINGLE ENTRY AND MULTIPLE > EXIT LOOPS > > The placement of the computation to reduce the register pressure for > Single Entry and Multiple exit by Simpson approach lead to unoptimal > solution. The unoptimal Solution is because of the exit node of the loop does > not post dominates all the basic block inside the Loops. Due to this the > placement of the computation just after the tail block of the Loop will lead > to incorrect results. In order to perform the Optimal Solution of the > placement of the computation, the computation needs to be placed the block > just after all the exit points of the Loop reconverge and which will post > dominates all the blocks of the Loops. This will take care of reducing the > register pressure for the Loops that are single Entry and Multiple Exit. For > irreducible Loops the optimization to convert to reducible is done before the > register allocation that reduces the register pressure and will be applicable > to structured control flow and thus reduces the register pressure. > > The Live range shrinkage reducing register pressure takes load and store into > consideration but not computation as proposed by Simpson. I am proposing to > extend in GCC for the computation to reduce register pressure and for the > Loop as given above for both Single Entry and Single Exit and Single Entry > and Multiple Exit Loops. > > >>Thanks for sharing with this. I have plans to work on a new register >>pressure relief pass too this summer (more accurately this work is in my >>company plans -- so I should work on this >>anyway). It will use Simpson's >>approach also. I prefer to do it as a separate pass not as a part of IRA >>because IRA is already complicated. It also permits to rematerialize not >>only on loop >>borders (although it is the most important points). >>It is hard for me to say what approach will be better as there are too many >>transformations even after IRA (e.g. in IRA you can think that pseudos got >>hard registers and rematerilize >>from that but LRA may spill this pseudos >>and it is more risk of this for x86/x86-64 than for other architectures as >>x86/x86-64 uses irregular reg file and has smaller number of regs). >>So we could implement two approaches and choose the best if you want. >>As for optimal placement o
Re: [GSoC] How to get started with the isl code generation
On 25/05/2014 13:12, Roman Gareev wrote: Hi Tobias, I tried to incorporate all your comments in the following patch. It also contains traversing of ISL AST and its dump to a file. You can find out more about this at the following link http://romangareev.blogspot.ru/2014/05/gsoc-report-i.html Hi Roman, thanks for this update (and expecially the nice blog post). Also the size of the patch is great. Small enough to be easy to review and still adding good features. diff --git a/gcc/common.opt b/gcc/common.opt index 5c3f834..005042d 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1250,6 +1250,10 @@ fgraphite-identity Common Report Var(flag_graphite_identity) Optimization Enable Graphite Identity transformation +fgraphite-isl-code-gen +Common Report Var(flag_graphite_isl_code_gen) Optimization +Enable isl code generator in Graphite Can we use something like: -fgraphite-code-generator=[isl|cloog] This allows us to start with cloog as default and then flip the switch as soon as we are confident enough. diff --git a/gcc/graphite-clast-to-gimple.c b/gcc/graphite-clast-to-gimple.c index 9ac9b67..d16439d 100644 --- a/gcc/graphite-clast-to-gimple.c +++ b/gcc/graphite-clast-to-gimple.c @@ -28,6 +28,7 @@ along with GCC; see the file COPYING3. If not see #include #include #include +#include Can we put this into a new file? clast stands for CLOOG-AST. I would prefer to have a new file ala graphite-isl-ast-to-gimple.c or graphite-ast-generation.c #include #include #endif @@ -1657,6 +1658,221 @@ debug_generated_program (scop_p scop) print_generated_program (stderr, scop); } + +/* Integration of ISL code generator. */ + +/* TODO: eliminate this function as soon as possible */ + +void +print_indent (int indent, isl_printer *prn) +{ + int i; + for (i = 0; i < indent; i++) +{ + prn = isl_printer_print_str (prn, " "); +} +} Why is this function necessary in the first place? Are you aware of isl_printer_set_indent() and isl_printer_set_indent_prefix()? +static void +translate_isl_ast (int indent, __isl_keep isl_ast_node *node, + isl_printer *prn); + +/* Translates a isl_ast_node_for STMT to gimple. + TODO: replace output to FILE with translation + to GCC representation. */ + +static void +translate_isl_ast_node_for (int indent, __isl_keep isl_ast_node *node_for, + isl_printer *prn) +{ + isl_id *id; + const char *name; + const char *type; + type = isl_options_get_ast_iterator_type (isl_printer_get_ctx (prn)); + isl_ast_expr *for_iterator = isl_ast_node_for_get_iterator (node_for); + isl_ast_expr_free (for_iterator); + id = isl_ast_expr_get_id (for_iterator); + name = isl_id_get_name (id); + isl_id_free (id); + prn = isl_printer_print_str (prn, "\n"); + print_indent (indent, prn); + prn = isl_printer_print_str (prn, "for ("); + prn = isl_printer_print_str (prn, type); + prn = isl_printer_print_str (prn, " "); + prn = isl_printer_print_str (prn, name); + prn = isl_printer_print_str (prn, " = "); + isl_ast_expr *for_init = isl_ast_node_for_get_init (node_for); + prn = isl_printer_print_ast_expr (prn, for_init); + isl_ast_expr_free (for_init); + prn = isl_printer_print_str (prn, "; "); + isl_ast_expr *for_cond = isl_ast_node_for_get_cond (node_for); + prn = isl_printer_print_ast_expr (prn, for_cond); + isl_ast_expr_free (for_cond); + prn = isl_printer_print_str (prn, "; "); + prn = isl_printer_print_str (prn, name); + prn = isl_printer_print_str (prn, " += "); + isl_ast_expr *for_inc = isl_ast_node_for_get_inc (node_for); + prn = isl_printer_print_ast_expr (prn, for_inc); + isl_ast_expr_free (for_inc); + prn = isl_printer_print_str (prn, ")"); + isl_ast_node *for_body = isl_ast_node_for_get_body (node_for); + translate_isl_ast (indent + 2, for_body, prn); + isl_ast_node_free (for_body); +} + +/* Translates a clast guard statement STMT to gimple. + TODO: replace output to FILE with translation + to GCC representation. */ + +static void +translate_isl_ast_node_if (int indent, __isl_keep isl_ast_node *node_if, + isl_printer *prn) +{ + prn = isl_printer_print_str (prn, "\n"); + print_indent (indent, prn); + prn = isl_printer_print_str (prn, "if ("); + isl_ast_expr *if_cond = isl_ast_node_if_get_cond (node_if); + prn = isl_printer_print_ast_expr (prn, if_cond); + isl_ast_expr_free (if_cond); + prn = isl_printer_print_str (prn, ")"); + isl_ast_node *if_then = isl_ast_node_if_get_then (node_if); + translate_isl_ast (indent + 2, if_then, prn); + isl_ast_node_free (if_then); + if (isl_ast_node_if_has_else (node_if)) +{ + prn = isl_printer_print_str (prn, "else"); + isl_ast_node *if_else = isl_ast_node_if_get_else(node_if); + translate_isl_ast (indent + 2, if_else, prn); + isl_ast_node_free (if_else); +} +} + +/* Translates a clast guard statement STMT to gimple. + TODO: replace output to FILE with translation + to GCC representati
compiler output conflict
I ran across this puzzling difference between gcc and llvm today and think the specification to produce consistent output for this code should be worked out. http://stackoverflow.com/questions/15929795/llvm-and-gcc-different-output-same-code/23856132#23856132 I presented this in the Freenode #gcc chatroom too. John T http://www.mozilla.org Firefox browser, Thunderbird email, Seamonkey all-in-one, Sunbird calendar and more. Free open-source software for Windows, Linux, Mac OS and other systems
gcc-4.10-20140525 is now available
Snapshot gcc-4.10-20140525 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.10-20140525/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.10 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/trunk revision 210914 You'll find: gcc-4.10-20140525.tar.bz2Complete GCC MD5=ccf9e2bcd55ba1bee825b3c509a48d7d SHA1=6ae8bf8f0ef9d563aede862858cca0cc3c380c67 Diffs from 4.10-20140518 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.10 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
RFA: [VAX] SUBREG of MEM with a mode dependent address
GCC 4.8 for VAX is generating a subreg:HI for mem:SI indexed address. This eventually gets caught by an assert in change_address_1. Since the MEM rtx is SI, legimate_address_p thinks it's fine. I have a change to vax.md which catches these but it's extremely ugly and I have to think there's a better way. But I have to wonder why is gcc even constructing a subreg of a mem with a mode dependent address. (gdb) call debug_rtx(insn) (insn 73 72 374 12 (set (reg/v:HI 0 %r0 [orig:29 iCol ] [29]) (subreg:HI (mem/c:SI (plus:SI (mult:SI (reg/v:SI 10 %r10 [orig:22 i ] [22]) (const_int 4 [0x4])) (reg/v/f:SI 11 %r11 [orig:101 aiCol ] [101])) [4 MEM[base: _154, offset: 0B]+0 S4 A32]) 0)) sqlite3.c:92031 13 {movhi_2} (nil)) Since this wasn't movstricthi, this could be rewritten to avoid the subreg and just treat %r0 as SI as in: (insn 73 72 374 12 (set (reg/v:SI 0 %r0 [orig:29 iCol ] [29]) (mem/c:SI (plus:SI (mult:SI (reg/v:SI 10 %r10 [orig:22 i ] [22]) (const_int 4 [0x4])) (reg/v/f:SI 11 %r11 [orig:101 aiCol ] [101]) [4 MEM[base: _154, offset: 0B]+0 S4 A32]) 0)) sqlite3.c:92031 13 {movsi_2} But even if movhi is a define_expand, as far as I can tell there's isn't enough info to know whether that is possible. At that time, how can I tell that operands[0] will be a hard reg or operands[1] will be subreg of a mode dependent memory access? I've tried using secondary_reload and it called called with (subreg:HI (reg:SI 113 [ MEM[base: _154, offset: 0B] ]) 0) but it dies in change_address_1 before invoking the code returned in sri. I've tracked this down to reload replacing (reg:SI 113) with reg_equiv_mem (133) in the rtx. However, it doesn't verify the rtx is actually valid. I added a gcc_assert to trap this and got: #1 0x0089ab87 in eliminate_regs_1 (x=0x7f7fe7b5c498, mem_mode=VOIDmode, insn=0x0, may_use_invariant=true, for_costs=true) at /u1/netbsd-HEAD/src/tools/gcc/../../external/gpl3/gcc/dist/gcc/reload1.c:2850(gdb) list 2845 && reg_equivs 2846 && reg_equiv_memory_loc (REGNO (SUBREG_REG (x))) != 0) 2847{ 2848 new_rtx = SUBREG_REG (x); 2849 rtx z = reg_equiv_memory_loc (REGNO (new_rtx)); 2850 gcc_assert (memory_address_addr_space_p (GET_MODE (x), 2851 XEXP (z, 0), 2852 MEM_ADDR_SPACE (z))); 2853} 2854 else (gdb) call debug_rtx(z) (mem:SI (plus:SI (mult:SI (reg/v:SI 22 [ i ]) (const_int 4 [0x4])) (reg/v/f:SI 101 [ aiCol ])) [4 MEM[base: _154, offset: 0B]+0 S4 A32]) (gdb) call debug_rtx(x) (subreg:HI (reg:SI 113 [ MEM[base: _154, offset: 0B] ]) 0) #2 0x0089cb31 in elimination_costs_in_insn (insn=0x7f7fe7b5bbd0) at /u1/netbsd-HEAD/src/tools/gcc/../../external/gpl3/gcc/dist/gcc/reload1.c:3751 (gdb) call debug_rtx (insn) (insn 73 72 374 12 (set (nil) (subreg:HI (reg:SI 113 [ MEM[base: _154, offset: 0B] ]) 0)) /u1/netbsd-HEAD/src/external/public-domain/sqlite/lib/../dist/sqlite3.c:92031 14 {movhi} (expr_list:REG_DEAD (reg:SI 113 [ MEM[base: _154, offset: 0B] ]) (nil))) And now I'm stymied. The limits of gcc-ness are now exceeded :) I'n looking for ideas on how to proceed. Thanks.
Re: negative latencies
On 23-May-14 01:59 PM, Bernd Schmidt wrote: On 05/23/2014 10:07 AM, shmeel gutl wrote: Exposed pipeline is not my problem. Negative latency is my problem. I don't see negative latency for c6x, not in unit reservations and not in adjust cost. Did I miss something? You just need to model it differently. Rather than saying instruction A has a negative latency relative to instruction B, you need to describe that instruction B reads its inputs later than when it is actually issued. The mechanism used in the C6X backend is the scheduler's record_delay_slot_pair function. The scheduler would see B is issued (*) A is issued, executes and writes its outputs B reads its inputs (*) The two insns marked as (*) would be such a delay pair. The first one would generate code, the second one exists only for the purposes of building the right scheduling dependencies. Bernd Okay, I think that I have the idea. But I would still need to backtrack if the enabling instruction is not issued on time. I would also need to delay the dependent instruction if I can see in advance that the producer cannot be issued on time. And, as Vladimir pointed out, I need to watch out for various passes inserting unwanted instructions. Sounds like a big project.
Re: negative latencies
On 23-May-14 05:20 PM, Vladimir Makarov wrote: On 2014-05-23, 3:49 AM, shmeel gutl wrote: On 21-May-14 06:30 PM, Vladimir Makarov wrote: I am just curious what happens when you put insn2, insn1. and insn2 uses a result of insn1 in 6 cycles and insn1 producing the result in 3 cycles, but there are not ready functional units (e.g. arithmentic units) necessary for insn1 for 4 or more cycles. It is quite not trivial to guarantee that everything will be okay in general case if you put insn2 before insn1. This is not a problem for this architecture. The units are fully pipelined and the only conflicts are in the first stage, during instruction issue. That is, the vliw must be legal. The gcc dfa handles this case fine. Another problem is that besides insn-scheduler there are a lot of optimizations which can insert some insns between the two insns after scheduling. In this case the result might be not ready for insn2. So you at least should exclude 1st insn scheduling (before RA) and make 2nd insn scheduling as the very last pass. In general, a traditional approach is to do such things on assembler level (e.g. as for older MIPS processor without hardware interlocks). There is also IRA. I would need to tweak the definition of live range.
Re: compiler output conflict
On Sun, May 25, 2014 at 03:55:38PM -0700, John wrote: > I ran across this puzzling difference between gcc and llvm today and think > the specification to produce consistent output for this code should be worked > out. > http://stackoverflow.com/questions/15929795/llvm-and-gcc-different-output-same-code/23856132#23856132 In what way you misunderstand ISO C99, 6.5.2.2/10: "The order of evaluation of the function designator, the actual arguments, and subexpressions within the actual arguments is unspecified, but there is a sequence point before the actual call." Both compilers give sensible output for the bad testcase. Please follow on (if at all needed) to gcc-help mailing list, this has nothing to do with GCC development. Jakub