[Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

2016-06-07 Thread Sameera Deshpande
Hi Richard,

This is with reference to our discussion at GNU Tools Cauldron 2015 regarding 
my talk titled "Improving the effectiveness and generality of GCC 
auto-vectorization." Further to our prototype implementation of the concept, we 
have started implementing this concept in GCC.

We are following incremental model to add language support in our front-end, 
and corresponding back-end (for auto-vectorizer) will be added for feature 
completion.

Looking at the complexity and scale of the project, we have divided this 
project into subtasks listed below, for ease of implementation, testing and 
review.

0. Add new pass to perform autovectorization using unified representation - 
Current GCC framework does not give complete overview of the loop to be 
vectorized : it either breaks the loop across body, or across iterations. 
Because of which these data structures can not be reused for our approach which 
gathers all the information of loop body at one place using primitive permute 
operations. Hence, define new data structures and populate them.

1. Add support for vectorization of LOAD/STORE instructions
a. Create permute order tree for the loop with LOAD and STORE instructions 
for single or multi-dimensional arrays, aggregates within nested loops.
b. Basic transformation phase to generate vectorized code for the primitive 
reorder tree generated at stage 1a using tree tiling algorithm. This phase 
handles code generation for SCATTER, GATHER, stridded memory accesses etc. 
along with permute instruction generation.

2. Implementation of k-arity promotion/reduction : The permute nodes within 
primitive reorder tree generated from input program can have any arity. 
However, the target can support maximum of arity = 2 in most of the cases. 
Hence, we need to promote or reduce the arity of permute order tree to enable 
successful tree tiling.

3. Vector size reduction : Depending upon the vector size for target, reduce 
vector size per statement and adjust the loop count for vectorized loop 
accordingly.

4. Support simple arithmetic operations :
a. Add support for analyzing statements with simple arithmetic operations 
like +, -, *, / for vectorization, and create primitive reorder tree with 
compute_op.
b. Generate vector code for primitive reorder tree generated at stage 4a 
using tree tiling algorithm - here support for complex patterns like 
multiply-add should be checked and appropriate instruction to be generated.

5. Support reduction operation :
a. Add support for reduction operation analysis and primitive reorder tree 
generation. The reduction operation needs special handling, as the finish 
statement should COLLAPSE the temporary reduction vector TEMP_VAR into original 
reduction variable.
b. The code generation for primitive reorder tree does not need any 
handling - as reduction tree is same as tree generated in 4a, with only 
difference that in 4a, the destination is MEMREF (because of STORE operation) 
and for reduction it is TEMP_VAR. At this stage, generate code for COLLAPSE 
node in finish statements.

6. Support other vectorizable statements like complex arithmetic operations, 
bitwise operations, type conversions etc.
a. Add support for analysis and primitive reorder tree generation.
b. Vector code generation.

7. Cost effective tree tiling algorithm : Till now, the tree tiling is 
happening without considering cost of computation. However, there can be 
multiple target instructions covering the tree - hence, instead of picking 
first matched largest instruction cover, select the instruction cover based on 
cost of instruction given in .md for the target.

8. Optimizations on created primitive reorder tree : This stage is open ended, 
and depending upon perf analysis, the scope of it can be defined.

The current patch I have attached herewith handles stage 0 and 1a : Adds new 
pass to perform autovectorization using unified representation, defines new 
data structures to cater to this requirement and creates primitive reorder tree 
for LOAD/STORE instructions within the loop.

The whole loop is represented using the ITER_NODE, which have information about
- The preparatory statements for vectorization to be executed before entering 
the loop (like initialization of vectors, prepping for reduction operations, 
peeling etc.)
- Vectorizable loop body represented as PRIMOP_TREE (primitive reordering tree)
- Final statements (For peeling, variable loop bound, COLLAPSE operation for 
reduction etc.)
- Other loop attributes (loop bound, peeling needed, dependences, etc.)

Memory accesses within a loop have definite repetitive pattern which can be 
captured using primitive permute operators which can be used to  determine 
desired permute order for the vector computations.  The PRIMOP_TREE is AST 
which records all computations and permutations required to store  destination 
vector into continuous memory at the end of all iterations of the  loop. It can 
have 

gcc-5-20160607 is now available

2016-06-07 Thread gccadmin
Snapshot gcc-5-20160607 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/5-20160607/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 5 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-5-branch 
revision 237189

You'll find:

 gcc-5-20160607.tar.bz2   Complete GCC

  MD5=ac965defb36cfd93cdf01faa9aa6d716
  SHA1=b8f8615d34f743ea5b88cebbb04f26b12aca58ed

Diffs from 5-20160524 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-5
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


_Bool and trap representations

2016-06-07 Thread Alexander Cherepanov

Hi!

If a variable of type _Bool contains something different from 0 and 1 
its use amounts to UB in gcc and clang. There is a couple of examples in 
[1] ([2] is also interesting).


[1] https://github.com/TrustInSoft/tis-interpreter/issues/39
[2] https://github.com/TrustInSoft/tis-interpreter/issues/100

But my question is about the following example:

--
#include 

int main()
{
  _Bool b;
  *(char *)&b = 123;
  printf("%d\n", *(char *)&b);
}
--

Results:

--
$ gcc -std=c11 -pedantic -Wall -Wextra test.c && ./a.out
123

$ gcc -std=c11 -pedantic -Wall -Wextra -O3 test.c && ./a.out
1
--

gcc version: gcc (GCC) 7.0.0 20160604 (experimental)

It seems that padding in _Bool is treated as permanently unspecified. Is 
this behavior intentional? What's the theory behind it?


One possible explanations is C11, 6.2.6.2p1, which reads: "The values of 
any padding bits are unspecified." But it's somewhat a stretch to 
conclude from it that the values of padding bits cannot be specified 
even with explicit assignment.


Another possible approach is to refer to Committee Response for Question 
1 in DR 260 which reads: "Values may have any bit-pattern that validly 
represents them and the implementation is free to move between alternate 
representations (for example, it may normalize pointers, floating-point 
representations etc.). [...] the actual bit-pattern may change without 
direct action of the program."


Is similar behavior expected from other types of padding (padding in 
long double, padding bytes/bits in structs/unions) in the future? Maybe 
even normalization of pointers (randomly aligning misaligned pointers)?


--
Alexander Cherepanov