Offloading GSOC 2015

2015-03-03 Thread guray ozen
Hi all,

I finished my master at Barcelona Supercomputing Center and i started
to do PhD. My master thesis code generation OpenMP 4.0 for GPU
accelerators. And i am still working on it.

Last year i presented my research compiler MACC at IWOMP'14 which is
based on OmpSs runtime (http://pm.bsc.es/). You can check here my
paper and related paper
http://portais.fieb.org.br/senai/iwomp2014/presentations/Guray_Ozen.pptx
http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16

As far as i know, GCC 5 will come with OpenMP 4.0 and OpenACC
offloading. I am wondering that are there a any project related code
generation within gsoc 2015? Because when i checked todo list about
offloading, i couldn't come accross. or what am i supposed to do?

Güray Özen
~grypp


Re: Offloading GSOC 2015

2015-03-12 Thread guray ozen
Hi Thomas,

How can i progress about giving official proposal?  Which topics are
GCC interested in?

So far, i have been tried to influence the evolution of the omp 4.0
accelerator model. Sum up of my small achievements until now

- Using Shared Memory in a efficient way
--- I allowed array privatization for private/firstprivate clause of
teams and distribute directives
--- But it is not possible to use private/firstprivate for big arrays.
--- That's why I added dist_private([CHUNK] var-list) and
dist_firstprive([CHUNK] var-list) clause in order to use shared memory
for big arrays. briefly it is not putting all array into shared
memory. it is putting chunk of array into shared memory. and each
block is dealing with own chunk.
--- I added dist_lastprivate([CHUNK] var-list). Basically lastprivate
is not exist according to omp 4.0 standards, since there is no way to
do synchronization among GPU Blocks. But i took off this clause
doesn't need sync because it is using CHUNK. Thus, i can re-collect
data from shared memory. (you can see its animation at slide page
11-12 )

- Extension of device clause
--- I behave target directive as a task. Since i implemented based on
OmpSs, thus OmpSs can manage my task.
--- I didn't wait used to pass integer for device() clause. Thus
runtime automatically started to manage multiple GPU. (OmpSs runtime
is already doing this)
--- Also device-to-device data transfer became available. (Normally
there is no way to do this in omp)
(you can see its animation at slide page 10 )


Additionally, Nowadays i am working on 2 topic

1 - How to take advantage Dynamic parallelism.
--- While doing this, I am comparing dynamic parallelism with creation
extra threads in advance instead creating new kernel. Because DP
causes overhead and sometimes it might need communication between
child-parent thread. (for example when reduction occurred. and only
way to communicate global memory)

2 - Trying to find some advantages of Dynamic compilation. (from
opencl side is already available. form nvidia side it is just
announced with 7.0 nvrtc runtime compilation)

Best Regards,
Güray Özen
~grypp



2015-03-11 13:53 GMT+01:00 Thomas Schwinge :
> Hi!
>
> On Tue, 3 Mar 2015 16:16:21 +0100, guray ozen  wrote:
>> I finished my master at Barcelona Supercomputing Center and i started
>> to do PhD. My master thesis code generation OpenMP 4.0 for GPU
>> accelerators. And i am still working on it.
>>
>> Last year i presented my research compiler MACC at IWOMP'14 which is
>> based on OmpSs runtime (http://pm.bsc.es/). You can check here my
>> paper and related paper
>> http://portais.fieb.org.br/senai/iwomp2014/presentations/Guray_Ozen.pptx
>> http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16
>>
>> As far as i know, GCC 5 will come with OpenMP 4.0 and OpenACC
>> offloading. I am wondering that are there a any project related code
>> generation within gsoc 2015? Because when i checked todo list about
>> offloading, i couldn't come accross. or what am i supposed to do?
>
> The idea that you propose seems like a fine project for GSoC --
> definitely there'll be enough work to be done.  ;-)
>
> Somebody from the GCC side needs to step up as a mentor.
>
>
> Grüße,
>  Thomas


Re: Offloading GSOC 2015

2015-03-20 Thread guray ozen
Hi all,

I've started to prepare my gsoc proposal for gcc's openmp for gpus.
However i'm little bit confused about which ideas, i mentioned last my
mail, should i propose or which one of them is interesting for gcc.
I'm willing to work on data clauses to enhance performance of shared
memory. Or maybe it might be interesting to work on OpenMP 4.1 draft
version. How do you think i should propose idea?

Thanks
Güray Özen
~grypp



2015-03-12 12:23 GMT+01:00 guray ozen :
> Hi Thomas,
>
> How can i progress about giving official proposal?  Which topics are
> GCC interested in?
>
> So far, i have been tried to influence the evolution of the omp 4.0
> accelerator model. Sum up of my small achievements until now
>
> - Using Shared Memory in a efficient way
> --- I allowed array privatization for private/firstprivate clause of
> teams and distribute directives
> --- But it is not possible to use private/firstprivate for big arrays.
> --- That's why I added dist_private([CHUNK] var-list) and
> dist_firstprive([CHUNK] var-list) clause in order to use shared memory
> for big arrays. briefly it is not putting all array into shared
> memory. it is putting chunk of array into shared memory. and each
> block is dealing with own chunk.
> --- I added dist_lastprivate([CHUNK] var-list). Basically lastprivate
> is not exist according to omp 4.0 standards, since there is no way to
> do synchronization among GPU Blocks. But i took off this clause
> doesn't need sync because it is using CHUNK. Thus, i can re-collect
> data from shared memory. (you can see its animation at slide page
> 11-12 )
>
> - Extension of device clause
> --- I behave target directive as a task. Since i implemented based on
> OmpSs, thus OmpSs can manage my task.
> --- I didn't wait used to pass integer for device() clause. Thus
> runtime automatically started to manage multiple GPU. (OmpSs runtime
> is already doing this)
> --- Also device-to-device data transfer became available. (Normally
> there is no way to do this in omp)
> (you can see its animation at slide page 10 )
>
>
> Additionally, Nowadays i am working on 2 topic
>
> 1 - How to take advantage Dynamic parallelism.
> --- While doing this, I am comparing dynamic parallelism with creation
> extra threads in advance instead creating new kernel. Because DP
> causes overhead and sometimes it might need communication between
> child-parent thread. (for example when reduction occurred. and only
> way to communicate global memory)
>
> 2 - Trying to find some advantages of Dynamic compilation. (from
> opencl side is already available. form nvidia side it is just
> announced with 7.0 nvrtc runtime compilation)
>
> Best Regards,
> Güray Özen
> ~grypp
>
>
>
> 2015-03-11 13:53 GMT+01:00 Thomas Schwinge :
>> Hi!
>>
>> On Tue, 3 Mar 2015 16:16:21 +0100, guray ozen  wrote:
>>> I finished my master at Barcelona Supercomputing Center and i started
>>> to do PhD. My master thesis code generation OpenMP 4.0 for GPU
>>> accelerators. And i am still working on it.
>>>
>>> Last year i presented my research compiler MACC at IWOMP'14 which is
>>> based on OmpSs runtime (http://pm.bsc.es/). You can check here my
>>> paper and related paper
>>> http://portais.fieb.org.br/senai/iwomp2014/presentations/Guray_Ozen.pptx
>>> http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16
>>>
>>> As far as i know, GCC 5 will come with OpenMP 4.0 and OpenACC
>>> offloading. I am wondering that are there a any project related code
>>> generation within gsoc 2015? Because when i checked todo list about
>>> offloading, i couldn't come accross. or what am i supposed to do?
>>
>> The idea that you propose seems like a fine project for GSoC --
>> definitely there'll be enough work to be done.  ;-)
>>
>> Somebody from the GCC side needs to step up as a mentor.
>>
>>
>> Grüße,
>>  Thomas


Re: Offloading GSOC 2015

2015-03-23 Thread guray ozen
Hi Kirill,

Thread hierarchy management and creation policy is very interesting
topic for me as well. I came across that paper couple weeks ago.
Creating more threads in the beginning and applying suchlike
busy-waiting or if-master algorithm generally works better than
dynamic parallelism due to the overhead of dp. Moreover compiler might
close some optimizations when dp is enable. This paper Cuda-np[1] is
also interesting about managing threads. And its idea is very close
that to create more thread in advance instead of using dynamic
parallelism. However in the other hand, sometimes dp has better
performance since it let create new thread hierarchy.

In order to clarify, I prepared 2 examples while using dynamic
parallelism and creating more threads in advance.
*(1st example)  Better result is dynamic parallelism.
*(2nd example) Better result is creating more threads in advance

1st example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop0
*(prop0.c) Has 4 nested iteration
*(prop0.c:10)will put small array into shared memory
*Iteration size of first two loop is expressed explicitly. even if
they become clear in rt, ptx/spir can be changed
*Last two iteration is sizes are dynamic and dependent of first two
iterations' induction variables
*(prop0.c:24 - 28) there are array accessing in very inefficient way
(non-coalesced)
-If we put (prop0.c:21) #parallel for
-*It will create another kernel (prop0_dynamic.cu:34)
-*array accessing style will change  (prop0_dynamic.cu:48 - 52)

Basically advantages of creating dynamic parallelism in this point:
1- Accessing style to array is changed with coalasced
2- we could get rid of 3rd and 4th for loop since we could create
thread as iteration size. (little advantage in terms of thread
divergencency)

2nd example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop1
*Has 2 nested iteration
*Innermost loop has reduction
*I put 3 possible generated cuda code example
*1 - prop1_baseline.cu : only cudaize prop1.c:8 and don't take account
prop1.c:12
*2 - prop1_createMoreThread.cu : create more thread for innermost
loop. Do reduction with extra threads. communicate by using shared
memory.
*3 - prop1_dynamic.cu : create child kernel. Communicate by using
global memory. but allocate global memory in advance at
prop1_dynamic.cu:5

Full version of prop1 calculates nbody. I benchmarked with y reserach
compiler [2] and put results here
https://github.com/grypp/gcc-proposal-omp4/blob/master/prop1/prop1-bench.pdf
. As is seen from that figure, 2nd kernel has best performance.


When we compare these 2 example, my roughly idea about this issue
that,  it might be good idea to implement an inspector by using
compiler analyzing algorithms in order to decide whether dynamic
parallelism will be used or not. Thus it also can be possible to avoid
extra slowdown since compiler closes optimization when dp is enable.
Besides there is some another cases exist while we can take advantage
of dp such as recursive algorithms. Moreover using stream is available
even if not guarantee concurrency (it also causes overhead). In
addition to this, i can work on if-master or busy-waiting logic.

I am really willing to work on thread hierarchy management and
creation policy. if it is interesting for gcc, how can i progress on
this topic?


By the way, i haven't worked on #omp simd. it could be match with
warps (if there is no dependency among loops). (in nvidia side) since
threads in same warp can read their data with __shfl, data clauses can
be used to enhance performance. (Not sure)

[1] - http://people.engr.ncsu.edu/hzhou/ppopp_14_1.pdf
[2] - http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16
Güray Özen
~grypp



2015-03-20 15:47 GMT+01:00 Kirill Yukhin :
> Hello Güray,
>
> On 20 Mar 12:14, guray ozen wrote:
>> I've started to prepare my gsoc proposal for gcc's openmp for gpus.
> I think that here is wide range for exploration. As you know, OpenMP 4
> contains vectorization pragmas (`pragma omp simd') which not perfectly
> suites for GPGPU.
> Another problem is how to create threads dynamically on GPGPU. As far as
> we understand it there're two possible solutions:
>   1. Use dynamic parallelism available in recent API (launch new kernel from
>   target)
>   2. Estimate maximum thread number on host and start them all from host,
>   making unused threads busy-waiting
> There's a paper which investigates both approaches [1], [2].
>
>> However i'm little bit confused about which ideas, i mentioned last my
>> mail, should i propose or which one of them is interesting for gcc.
>> I'm willing to work on data clauses to enhance performance of shared
>> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft
>> version. How do you think i should propose idea?
> We're going to work on OpenMP 4.1 offloading features.
>
> [1] - http://openmp.org/sc14/Booth-Sam-IBM.pdf
> [2] - http://dl.acm.org/citation.cfm?id=2688364
>
> --
> Thanks, K


Re: Offloading GSOC 2015

2015-03-28 Thread guray ozen
Hi All,

I submitted my proposal via GSOC platform as tiny-many project. On
basis on Krill's reply, i decided to work on thread hierarchy manager.
The pdf version of proposal can be found here : [1] . Shortly my
proposal consist combining dynamic parallelism, extra thread creation
in advance and kernel splitting while code generating for GPUs. If you
have comments and suggestions are welcome.

Regards.

[1]: 
https://raw.githubusercontent.com/grypp/gcc-proposal-omp4/master/gsoc-gurayozen.pdf
Güray Özen
~grypp



2015-03-23 13:58 GMT+01:00 guray ozen :
> Hi Kirill,
>
> Thread hierarchy management and creation policy is very interesting
> topic for me as well. I came across that paper couple weeks ago.
> Creating more threads in the beginning and applying suchlike
> busy-waiting or if-master algorithm generally works better than
> dynamic parallelism due to the overhead of dp. Moreover compiler might
> close some optimizations when dp is enable. This paper Cuda-np[1] is
> also interesting about managing threads. And its idea is very close
> that to create more thread in advance instead of using dynamic
> parallelism. However in the other hand, sometimes dp has better
> performance since it let create new thread hierarchy.
>
> In order to clarify, I prepared 2 examples while using dynamic
> parallelism and creating more threads in advance.
> *(1st example)  Better result is dynamic parallelism.
> *(2nd example) Better result is creating more threads in advance
>
> 1st example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop0
> *(prop0.c) Has 4 nested iteration
> *(prop0.c:10)will put small array into shared memory
> *Iteration size of first two loop is expressed explicitly. even if
> they become clear in rt, ptx/spir can be changed
> *Last two iteration is sizes are dynamic and dependent of first two
> iterations' induction variables
> *(prop0.c:24 - 28) there are array accessing in very inefficient way
> (non-coalesced)
> -If we put (prop0.c:21) #parallel for
> -*It will create another kernel (prop0_dynamic.cu:34)
> -*array accessing style will change  (prop0_dynamic.cu:48 - 52)
>
> Basically advantages of creating dynamic parallelism in this point:
> 1- Accessing style to array is changed with coalasced
> 2- we could get rid of 3rd and 4th for loop since we could create
> thread as iteration size. (little advantage in terms of thread
> divergencency)
>
> 2nd example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop1
> *Has 2 nested iteration
> *Innermost loop has reduction
> *I put 3 possible generated cuda code example
> *1 - prop1_baseline.cu : only cudaize prop1.c:8 and don't take account
> prop1.c:12
> *2 - prop1_createMoreThread.cu : create more thread for innermost
> loop. Do reduction with extra threads. communicate by using shared
> memory.
> *3 - prop1_dynamic.cu : create child kernel. Communicate by using
> global memory. but allocate global memory in advance at
> prop1_dynamic.cu:5
>
> Full version of prop1 calculates nbody. I benchmarked with y reserach
> compiler [2] and put results here
> https://github.com/grypp/gcc-proposal-omp4/blob/master/prop1/prop1-bench.pdf
> . As is seen from that figure, 2nd kernel has best performance.
>
>
> When we compare these 2 example, my roughly idea about this issue
> that,  it might be good idea to implement an inspector by using
> compiler analyzing algorithms in order to decide whether dynamic
> parallelism will be used or not. Thus it also can be possible to avoid
> extra slowdown since compiler closes optimization when dp is enable.
> Besides there is some another cases exist while we can take advantage
> of dp such as recursive algorithms. Moreover using stream is available
> even if not guarantee concurrency (it also causes overhead). In
> addition to this, i can work on if-master or busy-waiting logic.
>
> I am really willing to work on thread hierarchy management and
> creation policy. if it is interesting for gcc, how can i progress on
> this topic?
>
>
> By the way, i haven't worked on #omp simd. it could be match with
> warps (if there is no dependency among loops). (in nvidia side) since
> threads in same warp can read their data with __shfl, data clauses can
> be used to enhance performance. (Not sure)
>
> [1] - http://people.engr.ncsu.edu/hzhou/ppopp_14_1.pdf
> [2] - http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16
> Güray Özen
> ~grypp
>
>
>
> 2015-03-20 15:47 GMT+01:00 Kirill Yukhin :
>> Hello Güray,
>>
>> On 20 Mar 12:14, guray ozen wrote:
>>> I've started to prepare my gsoc proposal for gcc's openmp for gpus.
>> I think that here is wide range for exploration. As you know, OpenMP 4
>>

OpenACC or OpenMP 4.0 target directives

2013-11-18 Thread guray ozen
Hello,

I'm doing master at Polytechnic University of Catalonia, BarcelonaTech
and I started to my master thesis. My topic is code generation for
hardware accelerator into OmpSs. OmpSs is being developed by Barcelona
Supercomputer Center, and it has a runtime for gpu. It can manage
kernel invocation, multi-gpu, data transfer, asyncronus kernel
invocation and so on. That's why i'm using OmpSs. Because i want to
only focus code generation and optimizations. But i'm so new for this
work. Now i support that "target", "teams", "distribute", "distribute
parallel for" directives. However of course i can generate a so naive
kernel :( I'm looking for optimization techniques.

I came across a news about gcc will support OpenACC/OpenMP target
directive. How can i download this version? Moreover i'm going to ask
question about optimization. Which optimization techniques have you
applied? Do you have a any suggestion for me for this thesis? (papers,
algorithms and so on)

Regards,

Güray Özen
~grypp


About gsoc 2014 OpenMP 4.0 Projects

2014-02-25 Thread guray ozen
Hello,

I'm master student at high-performance computing at barcelona
supercomputing center. And I'm working on my thesis regarding openmp
accelerator model implementation onto our compiler (OmpSs). Actually i
almost finished implementation of all new directives  to generate CUDA
code and same implementation OpenCL doesn't take so much according to
my design. But i haven't even tried for Intel mic and apu other
hardware accelerator :) Now i'm bench-marking output kernel codes
which are generated by my compiler. although output kernel is
generally naive, speedup is not very very bad. when I compare results
with HMPP OpenACC 3.2.x compiler, speedups are almost same or in some
cases my results are slightly better than. That's why in this term, i
am going to work on compiler level or runtime level optimizations for
gpus.

When i looked gcc openmp 4.0 project, i couldn't see any things about
code generation. Are you going to announce later? or should i apply
gsoc with my idea about code generations and device code
optimizations?

Güray Özen
~grypp


Re: About gsoc 2014 OpenMP 4.0 Projects

2014-02-27 Thread guray ozen
Hi Evgeny,

As i said, I'm working for source-to-source generation for my master
thesis. But my compiler can transform from C to CUDA not PTX now :)
For further information, I uploaded https://github.com/grypp/macc-omp4
my documents and code samples regarding my master thesis. I also added
my benchmark results which covers results of comparisons between CAPS
OpenACC and MACC. For now my compiler MACC has a better result than
CAPS OpenACC for jacobi application and CG application from NAS
parallel benchmark.

Actually I have never thought intermediate language translation. But
it is great idea to generate optimized code. But as i know, any NVidia
architecture doesn't support SPIR backend yet, right?

What i understood that currently GCC is working on SPIR code
generation to support OpenMP 4.0. So do you have any future plan to
generate PTX? Because SPIR backend is too new, it was announced almost
1 month ago. and i think your team has more experience on SPIR than
me. Therefore I'm asking that is there any project to implementation
about PTX?

By the way i couldn't see any specific project regarding openmp 4.0
http://gcc.gnu.org/wiki/openmp . Actually i am wondering for GSoC,
which area am i supposed to focus?

Regards.
Güray Özen
~grypp



2014-02-26 8:20 GMT+01:00 Evgeny Gavrin :
> Hi Guray,
>
> There were two announcements: PTX-backend and OpenCL code generation.
> Initial PTX-patches can be found in mailing list and OpenCL experiments in
> openacc_1-0_branch.
>
> Regarding GSoC it would be nice, if you'll apply with your proposal on code
> generation.
> I think that projects aimed to improve generation of OpenCL or
> implementation of SPIR-backend are going to be useful for GCC.
>
> -
> Thanks,
> Evgeny.
>
>
> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> guray ozen
> Sent: Tuesday, February 25, 2014 3:27 PM
> To: gcc@gcc.gnu.org
> Subject: About gsoc 2014 OpenMP 4.0 Projects
>
> Hello,
>
> I'm master student at high-performance computing at barcelona supercomputing
> center. And I'm working on my thesis regarding openmp accelerator model
> implementation onto our compiler (OmpSs). Actually i almost finished
> implementation of all new directives  to generate CUDA code and same
> implementation OpenCL doesn't take so much according to my design. But i
> haven't even tried for Intel mic and apu other hardware accelerator :) Now
> i'm bench-marking output kernel codes which are generated by my compiler.
> although output kernel is generally naive, speedup is not very very bad.
> when I compare results with HMPP OpenACC 3.2.x compiler, speedups are almost
> same or in some cases my results are slightly better than. That's why in
> this term, i am going to work on compiler level or runtime level
> optimizations for gpus.
>
> When i looked gcc openmp 4.0 project, i couldn't see any things about code
> generation. Are you going to announce later? or should i apply gsoc with my
> idea about code generations and device code optimizations?
>
> Güray Özen
> ~grypp
>