Re: Regarding x86 'sete' instruction and its corresponding RTL

2014-04-04 Thread Eric Botcazou
> 
> RTL: (set (reg:QI 0 ax)
>  (eq:QI (reg:CCZ 17 flags) (const_int 0)))
> 
> Assembly: sete %al
> 
> 
> Semantics of sete instruction is (as per Intel manual):
>   if zero flag = 1, (reg:QI ax)  = 1
>   else (reg:QI ax) = 0
> 
> Where as (I believe) RTL semantics seems to say that:
>  - if zero flag = 0, (reg:QI ax) = 1
>else (reg:QI ax) = 0
> 
> This is because 'eq' operator returns STORE_FLAG_VALUE when both
> operands of 'eq' are equal. Otherwise, it returns 0. This is exactly
> opposite of what assembly semantics is.

No, that's wrong, the semantics of the comparison operators applied to the CC 
register have nothing to do with STORE_FLAG_VALUE (see manual section 13.10).

 
Eric Botcazou


Re: WPA stream_out form & memory consumption

2014-04-04 Thread Martin Liška


On 04/03/2014 03:07 PM, Richard Biener wrote:

On Thu, Apr 3, 2014 at 2:07 PM, Martin Liška  wrote:

On 04/03/2014 11:41 AM, Richard Biener wrote:

On Wed, Apr 2, 2014 at 6:11 PM, Martin Liška  wrote:

On 04/02/2014 04:13 PM, Martin Liška wrote:


On 03/27/2014 10:48 AM, Martin Liška wrote:

Previous patch is wrong, I did a mistake in name ;)

Martin

On 03/27/2014 09:52 AM, Martin Liška wrote:


On 03/25/2014 09:50 PM, Jan Hubicka wrote:

Hello,
  I've been compiling Chromium with LTO and I noticed that WPA
stream_out forks and do parallel:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html.

I am unable to fit in 16GB memory: ld uses about 8GB and lto1 about
6GB. When WPA start to fork, memory consumption increases so that
lto1 is killed. I would appreciate an --param option to disable this
WPA fork. The number of forks is taken from build system (-flto=9)
which is fine for ltrans phase, because LD releases aforementioned
8GB.

What do you think about that?

I can take a look - our measurements suggested that the WPA memory
will
be later dominated by ltrans.  Perhaps Chromium does something that
makes
WPA to explode that would be interesting to analyze.  I did not
managed
to get through Chromium LTO build process recently (ninja builds are
not
my friends), can you send me the instructions?

Honza

Thanks,
Martin


There are instructions how can one build chromium with LTO:
1) install depot-tools and export PATH variable according to guide:
http://www.chromium.org/developers/how-tos/install-depot-tools
2) Checkout source code: gclient sync; cd src
3) Apply patch (enables system gold linker and disables LTO for a
sandbox that uses top-level asm)
4) which ld should point to ld.gold
5) unsure that ld.bfd points to ld.bfd
6) run: build/gyp_chromium -Dwerror=
7) ninja -C out/Release chrome -jX

If there are any problems, follow:
https://code.google.com/p/chromium/wiki/LinuxBuildInstructions

Martin


Hello,
taking latest trunk gcc, I built Firefox and Chromium. Both projects
compiled without debugging symbols and -O2 on an 8-core machine.

Firefox:
-flto=9, peak memory usage (in LTRANS): 11GB

Chromium:
-flto=6, peak memory usage (in parallel WPA phase ): 16.5GB

For details please see attached with graphs. The attachment contains
also
-fmem-report and -fmem-report-wpa.
I think reduced memory footprint to ~3.5GB is a bit optimistic:
http://gcc.gnu.org/gcc-4.9/changes.html

Is there any way we can reduce the memory footprint?

Attachment (due to size restriction):

https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing

Thank you,
Martin


Previous email presents a bit misleading graphs (influenced by
--enable-gather-detailed-mem-stats).

Firefox:
-flto=9, WPA peak: 8GB, LTRANS peak: 8GB
-flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB
-flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB

These data shows that parallel WPA streaming increases short-time memory
footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of
-flto=4).

For more details, please see the attachment.

The main overhead comes from maintaining the state during output of
the global types/decls.  We maintain somewhat "duplicate" info
here by having both the tree_ref_encoder and the streamer cache.
Eventually we can free the tree_ref_encoder pointer-map early, like with

Index: lto-streamer-out.c
===
--- lto-streamer-out.c  (revision 209018)
+++ lto-streamer-out.c  (working copy)
@@ -2423,10 +2455,18 @@ produce_asm_for_decls (void)

 gcc_assert (!alias_pairs);

-  /* Write the global symbols.  */
+  /* Get rid of the global decl state hash tables to save some memory.
*/
 out_state = lto_get_out_decl_state ();
-  num_fns = lto_function_decl_states.length ();
+  for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
+if (out_state->streams[i].tree_hash_table)
+  {
+   delete out_state->streams[i].tree_hash_table;
+   out_state->streams[i].tree_hash_table = NULL;
+  }
+
+  /* Write the global symbols.  */
 lto_output_decl_state_streams (ob, out_state);
+  num_fns = lto_function_decl_states.length ();
 for (idx = 0; idx < num_fns; idx++)
   {
 fn_out_state =

as we do already for the fn state streams (untested).

we can also avoid re-allocating the output hashtable/vector by, after
(or in) create_output_block, allocate a bigger initial size for the
streamer_tree_cache.  Note that the pointer-set already expands if
the fill level is > 25%, and it really exponentially grows (similar to
hash_table, btw, but that grows only at 75% fill level).

OTOH simply summing then lengths of all decl streams results in
a lower value than the actual number of output trees in the output block.
Humm.

But this is clearly the data structure that could be worth optimizing
in some way.  For example during writing we don't need the
streamer cache nodes array (we just need a counter to assign indexes).

The attached is a patch that tries to d

Re: Regarding x86 'sete' instruction and its corresponding RTL

2014-04-04 Thread Niranjan Hasabnis
Hi Eric,

Thank you for your reply. I referred to section 13.10, and the description
there does not precisely specify the result of comparison with CC register.
Yes, you are right that as per the description, comparison with CC may not
have anything to do with STORE_FLAG_VALUE. But it clearly says that
when the comparison fails, the result is 0. And this seems to be exactly
opposite of semantics of 'sete' instruction. So the problem is still
not solved.
Am I misreading something? Please let me know.

On Fri, Apr 4, 2014 at 3:58 AM, Eric Botcazou  wrote:
>> 
>> RTL: (set (reg:QI 0 ax)
>>  (eq:QI (reg:CCZ 17 flags) (const_int 0)))
>>
>> Assembly: sete %al
>> 
>>
>> Semantics of sete instruction is (as per Intel manual):
>>   if zero flag = 1, (reg:QI ax)  = 1
>>   else (reg:QI ax) = 0
>>
>> Where as (I believe) RTL semantics seems to say that:
>>  - if zero flag = 0, (reg:QI ax) = 1
>>else (reg:QI ax) = 0
>>
>> This is because 'eq' operator returns STORE_FLAG_VALUE when both
>> operands of 'eq' are equal. Otherwise, it returns 0. This is exactly
>> opposite of what assembly semantics is.
>
> No, that's wrong, the semantics of the comparison operators applied to the CC
> register have nothing to do with STORE_FLAG_VALUE (see manual section 13.10).
>
>
> Eric Botcazou



-- 
--
Regards,
Niranjan Hasabnis.


Re: Regarding x86 'sete' instruction and its corresponding RTL

2014-04-04 Thread Eric Botcazou
> Thank you for your reply. I referred to section 13.10, and the description
> there does not precisely specify the result of comparison with CC register.

Quoting section 13.10:

"There are two ways that comparison operations may be used.  The
comparison operators may be used to compare the condition codes `(cc0)'
against zero, as in `(eq (cc0) (const_int 0))'.  Such a construct
actually refers to the result of the preceding instruction in which the
condition codes were set.

[...]

 In the example above, if `(cc0)' were last set to `(compare X Y)', the
comparison operation is identical to `(eq X Y)'.  Usually only one style
of comparisons is supported on a particular machine, but the combine
pass will try to merge the operations to produce the `eq' shown in case
it exists in the context of the particular insn involved."

-- 
Eric Botcazou