On Thu, 18 Jul 2024, Richard Biener wrote:
> >If both b and c are scalars and the type of true?b:c has the same size
> >as the element type of a, then b and c are converted to a vector type
> >whose elements have this type and with the same number of elements as a.
> >
> > (in https:
On Thu, 18 Jul 2024, Richard Biener wrote:
> > (also, in C op2 and op3 of a ternary operator always have integer promotions
> > applied, but for vector selection we should use unpromoted types)
>
> Yes. So a good testcase would use char typed variable then. It’s
> unfortunate C and C++ do no
build-time dependency on installed headers.
So maybe we have that as an option too.
Alexander
On Fri, 22 Dec 2023, Alexander Monakov wrote:
> From: Daniil Frolov
>
> PR 66487 is asking to provide sanitizer-like detection for C++ object
> lifetime violations that are worked around with
In PR 114480 we are hitting a case where tree-into-ssa scales
quadratically due to prune_unused_phi_nodes doing O(N log N)
work for N basic blocks, for each variable individually.
Sorting the 'defs' array is especially costly.
It is possible to assist gcc_qsort by laying out dfs_out entries
in the
Hello!
I looked at ternlog a bit last year, so I'd like to offer some drive-by
comments. If you want to tackle them in a follow-up patch, or leave for
someone else to handle, please let me know.
On Fri, 17 May 2024, Roger Sayle wrote:
> This revised patch has been tested on x86_64-pc-linux-gnu
On Tue, 28 May 2024, Richard Biener wrote:
> On Wed, May 15, 2024 at 12:59 PM Alexander Monakov wrote:
> >
> >
> > Hello,
> >
> > I'd like to ask if anyone has any new thoughts on this patch.
> >
> > Let me also point out that valgrind/memcheck.
On Tue, 28 May 2024, Richard Biener wrote:
> > I am a bit confused what you mean by "cheaper". Could it be that we are not
> > on the same page regarding the machine code behind client requests?
>
> Probably "cheaper" in register usage.
But it doesn't matter considering that execution under Va
295,155 branch-misses:u #0.65% of all
branches ( +- 0.11% )
0.033242 +- 0.000418 seconds time elapsed ( +- 1.26% )
Is the revised patch below still ok? I've rolled the configury changes into it,
and dropped the (now unnecessary) AVX2 helper
On Thu, 22 Aug 2024, Marc Poulhiès wrote:
> Single argument static_assert is C++17 only.
>
> libcpp/ChangeLog:
>
> * lex.cc: fix static_assert to use 2 arguments.
When pushing, please fix the entry to mention the function name if that's
not too much trouble:
* lex.cc (search_line
The recently introduced search_line_fast_ssse3 raised padding
requirement from 16 to 64, which was adjusted in read_file_guts,
but the corresponding ' + 16' in _cpp_convert_input was overlooked.
libcpp/ChangeLog:
PR preprocessor/116458
* charset.cc (_cpp_convert_input): Bump paddi
Tie together the two functions that ensure tail padding with
search_line_ssse3 via CPP_BUFFER_PADDING macro.
libcpp/ChangeLog:
* internal.h (CPP_BUFFER_PADDING): New macro; use it ...
* charset.cc (_cpp_convert_input): ...here, and ...
* files.cc (read_file_guts): ...here,
On Fri, 14 Jun 2024, Kong, Lingling wrote:
> APX CFCMOV[1] feature implements conditionally faulting which means that all
> memory faults are suppressed
> when the condition code evaluates to false and load or store a memory
> operand. Now we could load or store a
> memory operand may trap or
On Tue, 30 Jul 2024, Richard Biener wrote:
> > Oh, and please add a small comment why we don't use XFmode here.
>
> Will do.
>
> /* Do not enable XFmode, there is padding in it and it suffers
>from normalization upon load like SFmode and DFmode when
>not using S
On Tue, 30 Jul 2024, Jakub Jelinek wrote:
> On Tue, Jul 30, 2024 at 03:43:25PM +0300, Alexander Monakov wrote:
> >
> > On Tue, 30 Jul 2024, Richard Biener wrote:
> >
> > > > Oh, and please add a small comment why we don't use XFmode here.
> > >
On Tue, Jul 30, 2024 at 03:00:49PM +0200, Richard Biener wrote:
> > What mangling fld performs depends on the contents of the FP control
> > word which is awkward.
For float/double loads (FLDS and FLDL) we know format conversion changes
SNaNs to QNaNs, but it's a widening conversion, so e.g. ro
Hi,
On Tue, 30 Jul 2024, Andi Kleen wrote:
> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
>
> Also adjust the code to allow inlini
On Tue, 30 Jul 2024, Andi Kleen wrote:
> > I have looked at this code before. When AVX2 is available, so is SSSE3,
> > and then a much more efficient approach is available: instead of comparing
> > against \r \n \\ ? one-by-one, build a vector
> >
> > 0 1 2 3 4 5 6 7 8 9a bc
Hello!
As discussed, I'm sending patches that reimplement our SSE4.2 search_line_fast
helper with SSSE3, and then add the corresponding AVX2 helper. They are on top
of Andi's "Remove MMX code path in lexer" patch, which was approved, but not
committed yet (Andi, can you push your own patch?).
App
Upcoming patches first drop Binutils ISA support from SSE4.2 to SSSE3,
then bump it to AVX2. Instead of fiddling with detection, just bump
our configure check to AVX2 immediately: if by some accident somebody
builds GCC without AVX2 support in the assembler, they will get SSE2
vectorized lexer, whi
Since the characters we are searching for (CR, LF, '\', '?') all have
distinct ASCII codes mod 16, PSHUFB can help match them all at once.
libcpp/ChangeLog:
* lex.cc (search_line_sse42): Replace with...
(search_line_ssse3): ... this new function. Adjust the use...
(init_v
Use the same PSHUFB-based matching as in the SSSE3 helper, just 2x
wider.
Directly use the new helper if __AVX2__ is defined. It makes the other
helpers unused, so mark them inline to prevent warnings.
Rewrite and simplify init_vectorized_lexer.
libcpp/ChangeLog:
* files.cc (read_file_g
On Tue, 6 Aug 2024, Alexander Monakov wrote:
> --- a/libcpp/files.cc
> +++ b/libcpp/files.cc
[...]
> + pad = HAVE_AVX2 ? 32 : 16;
This should have been
#ifdef HAVE_AVX2
pad = 32;
#else
pad = 16;
#endif
Alexander
On Wed, 7 Aug 2024, Richard Biener wrote:
> OK with that change.
>
> Did you think about a AVX512 version (possibly with 32 byte vectors)?
> In case there's a more efficient variant of pshufb/pmovmskb available
> there - possibly
> the load on the branch unit could be lessened with using maskin
On Wed, 7 Aug 2024, Richard Biener wrote:
> > > + data = *(const v16qi_u *)s;
> > > + /* Prevent propagation into pshufb and pcmp as memory operand. */
> > > + __asm__ ("" : "+x" (data));
> >
> > It would probably make sense to a file a PR on this separately,
> > to eventually fi
On Wed, 7 Aug 2024, Richard Biener wrote:
> > > This is probably to work around bugs in older compiler versions? If
> > > not I agree.
> >
> > This is deliberate hand-tuning to avoid a subtle issue: pshufb is not
> > macro-fused on Intel, so with propagation it is two uops early in the
> > CPU
On Thu, 22 Aug 2013, Gabriel Dos Reis wrote:
> > - I would like to recall issue if we can make NEW_EXPR annotated with
> >MALLOC attribute. Without it, it is basically impossible to track
> >any dynamically allocated objects in the middle-end
>
> operator new is replaceable by user progr
On Thu, 22 Aug 2013, Jan Hubicka wrote:
> > On Thu, 22 Aug 2013, Gabriel Dos Reis wrote:
> > > > - I would like to recall issue if we can make NEW_EXPR annotated with
> > > >MALLOC attribute. Without it, it is basically impossible to track
> > > >any dynamically allocated objects in th
Hello,
Could you use the existing facilities instead, such as adjust_priority hook,
or making the compare-branch insn sequence a SCHED_GROUP?
Alexander
On Wed, Sep 4, 2013 at 9:53 PM, Steven Bosscher wrote:
>
> On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
> > Hello,
> >
> > Could you use the existing facilities instead, such as adjust_priority hook,
> > or making the compare-branch insn sequ
On Fri, 6 Sep 2013, Wei Mi wrote:
> SCHED_GROUP works after I add chain_to_prev_insn after
> add_branch_dependences, in order to chain control dependences to prev
> insn for sched group.
chain_to_prev_insn is done in the end of deps_analyze_insn, why is that not
sufficient?
Alexander
On Tue, 10 Sep 2013, Wei Mi wrote:
> Because deps_analyze_insn only analyzes data deps but no control deps.
> Control deps are included by add_branch_dependences. Without the
> chain_to_prev_insn in the end of add_branch_dependences, jmp will be
> control dependent on every previous insn in the
On Wed, 11 Sep 2013, Wei Mi wrote:
> I tried that and it caused some regressions, so I choosed to do
> chain_to_prev_insn another time in add_branch_dependences. There could
> be some dependence between those two functions.
(please don't top-post on this list)
In that case you can adjust 'last
On Wed, 11 Sep 2013, Wei Mi wrote:
> I agree with you that explicit handling in sched-deps.c for this
> feature looks not good. So I move it to sched_init (Instead of
> ix86_sched_init_global because ix86_sched_init_global is used to
> install scheduling hooks), and then it is possible for other
On Thu, 12 Sep 2013, Wei Mi wrote:
> Thanks, fixed. New patch attached.
Thanks. At this point you need feedback from x86 and scheduler maintainers.
I would recommend you to resubmit the patch with a Changelog text, and with
the text of the patch inline in the email (your last mail has the patch
Hello,
You probably want to disable this transformation when the number of iterations
is predicted to be small, right?
Shouldn't dot product transform be predicated on -fassociative-math?
Do you have a vision of a generalized pattern matcher to allow adding other
routines easily?
I'm curious wh
, unconditionally. That looks historical, or an oversight rather than
deliberate, and the following patch makes expansion use ix86_fp_compare_mode
like in other places. Bootstrapped and regtested on x86_64-linux.
2013-10-11 Alexander Monakov
* config/i386/i386.c
Hello,
A very minor nit: in common.opt, entries for the options should be changed to
fregmove
Common Ignore
Does nothing. Preserved for backward compatibility.
instead of removing them altogether, so the compiler does not start rejecting
build commands with such options. There are now a few suc
This patch allows to use __attribute__((shared)) to place non-automatic
variables in shared memory.
* config/nvptx/nvptx.c (nvptx_encode_section_info): Handle "shared"
attribute.
(nvptx_handle_shared_attribute): New. Use it...
(nvptx_attribute_table): ... here (new
* config/nvptx/nvptx.c (nvptx_declare_function_name): Fix warning.
---
gcc/ChangeLog.gomp-nvptx | 4
gcc/config/nvptx/nvptx.c | 2 +-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 45aebdd..f63f840 100644
--- a/g
Link libgfortran for offloaded code as well.
* testsuite/libgomp.fortran/fortran.exp (lang_link_flags): Pass
-foffload=-lgfortran in addition to -lgfortran.
* testsuite/libgomp.oacc-fortran/fortran.exp (lang_link_flags): Ditto.
---
libgomp/ChangeLog.gomp-nvptx
This patch removes two zero-size stubs, there's no need for these overrides.
* config/nvptx/section.c: Delete.
* config/nvptx/splay-tree.c: Delete.
---
libgomp/ChangeLog.gomp-nvptx | 5 +
libgomp/config/nvptx/sections.c | 0
libgomp/config/nvptx/splay-tree.c | 0
3 file
This patch adds a few insn patterns used for OpenMP SIMD
reduction/lastprivate/ordered lowering for SIMT execution. OpenMP lowering
produces GOMP_SIMT_... internal functions when lowering SIMD constructs that
can be offloaded to a SIMT device. After lto stream-in, those internal
functions are tri
Hello,
I'm pushing this patch series to the gomp-nvptx branch. It adds the
following:
- backend and omp-low.c changes for SIMT-style SIMD region handling
- libgomp changes for running the fortran testsuite
- libgomp changes for spawning multiple OpenMP teams
I'll perform a trunk merge and
This patch removes the nvptx fortran.c stub that provides only
_gfortran_abort. It is possible to link libgfortran on NVPTX with
-foffload=-lgfortran.
* config/nvptx/fortran.c: Delete.
---
libgomp/ChangeLog.gomp-nvptx | 4
libgomp/config/nvptx/fortran.c | 40 -
This patch implements time.c on NVPTX with the %clock64 register. The PTX
documentation describes %globaltimer as explicitely off-limits for us.
* config/nvptx/time.c: New.
---
libgomp/ChangeLog.gomp-nvptx | 4
libgomp/config/nvptx/time.c | 49 +
This is the libgomp plugin side of omp_clock_wtime support on NVPTX. Query
GPU frequency and copy the value into the device image.
At the moment CUDA driver sets GPU to a fixed frequency when a CUDA context is
created (the default is to use the highest non-boost frequency, but it can be
altered w
This patch extends SIMD-via-SIMT lowering in omp-low.c to handle all loops,
lowering reduction/lastprivate/ordered appropriately (but it still chickens
out on collapsed loops, handling them as if safelen=1). New SIMT lowering
snippets use new internal functions that are folded for non-SIMT targets
On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > You mean you already have implemented something along the lines I
> > proposed?
>
> Yes, I was implementing OpenMP teams, and it made sense to add warps per block
> limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... a
uot;../../lock.c"
#ifdef LIBGOMP_GNU_SYMBOL_VERSIONING
/* gomp_mutex_* can be safely locked in one thread and
diff --git a/libgomp/config/nvptx/lock.c b/libgomp/config/nvptx/lock.c
index e69de29..7731704 100644
--- a/libgomp/config/nvptx/lock.c
+++ b/libgomp/config/nvptx/lock.c
@@ -0,0 +1,41
This adds necessary plumbing to spawn multiple teams.
To be reverted on this branch prior to merge.
---
gcc/builtin-types.def| 7 +-
gcc/fortran/types.def| 5 +-
gcc/omp-builtins.def | 2 +-
gcc/omp-low.c
* config/nvptx/icv-device.c (omp_get_num_teams): Update.
(omp_get_team_num): Ditto.
* config/nvptx/target.c (GOMP_teams): Update.
* config/nvptx/team.c (nvptx_thrs): Place in shared memory.
* icv.c (gomp_num_teams_var): Define.
* libgomp.h (gomp_num_t
This complements multiple teams support on the libgomp plugin side.
* plugin/plugin-nvptx.c (struct targ_fn_descriptor): Add new fields.
(struct ptx_device): Ditto. Set them...
(nvptx_open_device): ...here.
(GOMP_OFFLOAD_load_image): Set new targ_fn_descriptor fiel
On Wed, 20 Jan 2016, Ilya Verbin wrote:
> I agree that OpenMP doesn't guarantee that all target regions must be executed
> on the device, but in this case a user can't be sure that some library
> function
> always will offload (because the library might be replaced by fallback
> version),
> and h
On Tue, 26 Jan 2016, Thomas Schwinge wrote:
> A very similar problem also exists for nvptx offloading (Nathan CCed),
> where we emit similar warnings (enabled by default). As nvptx offloading
> happens during link-time (not compile-time, as with hsa offloading),
> these don't affect GCC's compile
On Mon, 11 Jan 2016, Alexander Monakov wrote:
> On Mon, 11 Jan 2016, Thomas Schwinge wrote:
> > Alexander, would you please also submit a fix for that for nvptx-tools'
> > nvptx-run.c? (Or want me to do that?)
>
> I can do that, along with another small change I used f
Hello!
The following patch fixes subtle breakage on NVPTX when unsigned comparisons
would be sometimes mistranslated by PTX JIT. The new test demonstrates that
by using the %nctaid.x (number of blocks) register, but a comparison against
a value in constant memory can also trigger that (I could ea
Hello,
On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> 1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime
> choice. (0 also indicates that, but I thought best to have an explicit syntax
> as well).
Does it work when the user specifies one of the dimensions, so that references
On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> You can only override at runtime those dimensions that you said you'd override
> at runtime when you compiled your program.
Ah, I see. That's not obvious to me, so perhaps added documentation can be
expanded to explain that? (I now see that the plugin
t 3 arguments to untagged entries. Old libgomp
plugins unaware of the change should be able to detect failure to provide
sufficient arguments to entries emitted from new compiler from the failure of
cuLaunchKernel
Alexander Monakov (5):
libgomp plugin: correct types
Revert "nvptx plugin:
.
Revert
2015-12-09 Alexander Monakov
* plugin/plugin-nvptx.c (nvptx_open_device): Adjust heap size.
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 79fd253..cb6a3ac 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin
Handling of arguments array wrongly assumed that 'long' would match the size
of 'void *'. As that would break on MinGW, use 'intptr_t'. Use 'int' for
'teams' and 'threads', as that's what cuLaunchKernel accepts.
* plugin/plugin-nvptx.c (nvptx_adjust_launch_bounds): Adjust types.
This patch implements the NVPTX backend part of the transition to
host-allocated soft stacks. The compiler-emitted kernel entry code now
accepts a pointer to stack storage and per-warp stack size, and initialized
__nvptx_stacks based on that (as well as trivially initializing __nvptx_uni).
The re
This patch implements the NVPTX libgomp part of the transition to
host-allocated soft stacks. The wrapper around gomp_nvptx_main previously
responsible for that is no longer needed.
This mostly reverts commit b408f1293e29a009ba70a3fda7b800277e1f310a.
* config/nvptx/team.c: (gomp_nvptx_ma
This patch implements the libgomp plugin part of the transition to
host-allocated soft stacks. For now only a simple scheme with
allocation/deallocation per launch is implemented; a followup change is
planned to cache and reuse allocations when appropriate.
The call to cuLaunchKernel is changed t
On Mon, 22 Feb 2016, Nathan Sidwell wrote:
> On 02/15/16 13:44, Alexander Monakov wrote:
> > This patch implements the NVPTX backend part of the transition to
>
> > + static const char template64[] = ENTRY_TEMPLATE ("64", "8",
> > "ma
On Mon, 22 Feb 2016, Nathan Sidwell wrote:
> On 02/22/16 15:25, Alexander Monakov wrote:
>
> > Template strings have an embedded nul character at the position where ORIG
> > goes, so template_2 is set to point at the position following the embedded
> > nul
> >
Hello Nathan,
On Wed, 9 Sep 2015, Nathan Sidwell wrote:
> I've applied this patch to port some cleanups, mainly formatting and loop
> idioms from the gomp4 branch.
This patch that you committed to trunk in September 2015 forcefully disables
generation of line number information, undoing a part of
> Attachment is the patch which repair -fno-plt support for AArch64.
>
> aarch64_is_noplt_call_p will only be true if:
>
> * gcc is generating position independent code.
> * function symbol has declaration.
> * either -fno-plt or "(no_plt)" attribute specified.
> * it's a external functio
This patch makes one OpenACC-specific path in nvptx_record_offload_symbol
optional.
* config/nvptx/nvptx.c (nvptx_record_offload_symbol): Allow missing
OpenACC attributes.
---
gcc/config/nvptx/nvptx.c | 19 +++
1 file changed, 11 insertions(+), 8 deletions(-)
diff
This is a minimal patch for NVPTX OpenMP offloading, using Jakub's initial
implementation. It allows to successfully run '#pragma omp target', without
any parallel execution: 1 team of 1 thread is spawned on the device, and
target regions with '#pragma omp parallel' will fail with a link error.
This patch allows to see when target regions are executed on host with
GOMP_DEBUG=1 in the environment.
* target.c (GOMP_target): Use gomp_debug on fallback path.
---
libgomp/target.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/libgomp/target.c b/libgomp/target.c
index 6ca80ad..1c
This patch allows to meaningfully invoke mkoffload with -fopenmp. The check
for -fopenacc flag is specific to gomp4 branch: trunk does not have it.
* config/nvptx/mkoffload.c (main): Do not check for -fopenacc.
---
gcc/config/nvptx/mkoffload.c | 7 ++-
1 file changed, 2 insertions(+)
Hello,
This patch series implements some minimally required changes to have OpenMP
offloading working for NVPTX target on the gomp4 branch. '#pragma omp target'
and data updates should work, but all parallel execution functionality remains
stubbed out (uses of '#pragma omp parallel' in target reg
Although newlib headers define most pthreads types, pthread_attr_t is not
available. Macro-replace it by 'void' to keep the prototype of
gomp_init_thread_affinity unchanged, and do not declare gomp_thread_attr.
* libgomp.h: Define pthread_attr_t to void on NVPTX.
---
libgomp/libgomp.h |
This stub header only provides empty struct gomp_barrier_t. For now I've
punted on providing a minimally-correct implementation.
* config/nvptx/bar.h: New file.
---
libgomp/config/nvptx/bar.h | 38 ++
1 file changed, 38 insertions(+)
create mode 10064
This patch ports env.c to NVPTX. It drops all environment parsing routines
since there's no "environment" on the device. For now, the useful effect of
the patch is providing 'omp_is_initial_device' to distinguish host execution
from target execution in user code.
Several functions use gomp_icv,
@@ -0,0 +1,67 @@
+/* Copyright (C) 2015 Free Software Foundation, Inc.
+ Contributed by Alexander Monakov
+
+ This file is part of the GNU Offloading and Multi Processing Library
+ (libgomp).
+
+ Libgomp is free software; you can redistribute it and/or modify it
+ under the terms of the GNU
On Wed, 23 Sep 2015, Bernd Schmidt wrote:
> I have two major concerns here. Can I ask you how much experience you have
> with GPU programming and ptx?
I'd say I have a good understanding of the programming model and nvidia
hardware architecture, having used CUDA tools and payed attention to
r&d ne
On Thu, 24 Sep 2015, Jakub Jelinek wrote:
> On Wed, Sep 23, 2015 at 08:22:22PM +0300, Alexander Monakov wrote:
> > This patch ports env.c to NVPTX. It drops all environment parsing routines
> > since there's no "environment" on the device. For now, the use
> I'd prefer here the https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01418.html
> changes to libgomp.h and associated configury changes.
OK, like the following?
[gomp4] libgomp: guard pthreads usage by LIBGOMP_USE_PTHREADS
This allows to avoid referencing pthread types and functions on nvptx.
On Thu, 15 Oct 2015, Jakub Jelinek wrote:
> Looking at Cuda, for async target region kernels we'd probably use
> a non-default stream and enqueue the async kernel in there. I see
> we can e.g. cudaEventRecord into the stream and then either cudaEventQuery
> to busy poll the event, or cudaEventSync
On Thu, 15 Oct 2015, Jakub Jelinek wrote:
> > - this functionality doesn't currently work through CUDA MPS
> > ("multi-process
> > server", for funneling CUDA calls from different processes through a
> > single "server" process, avoiding context-switch overhead on the device,
> > som
The NVPTX backend emits each functions either as .func (callable only from the
device code) or as .kernel (entry point for a parallel region). OpenMP
lowering adds "omp target entrypoint" attribute to functions outlined from
target regions. Unlike OpenACC offloading, OpenMP offloading does not in
Hello,
This patch series moves libgomp/nvptx porting further along to get initial
bits of parallel execution working, mostly unbreaking the testsuite. Please
have a look! I'm interested in feedback, and would like to know if it's
suitable to become a part of a branch.
This patch series ports en
Due to special treatment of types, emitting variables of type _Bool in
global scope is impossible: extern references are emitted with .u8, but
definitions use .u64. This patch fixes the issue by treating boolean type as
integer types.
* config/nvptx/nvptx.c (init_output_initializer): Also
This allows to emit decls in 'shared' memory from the middle-end.
* config/nvptx/nvptx.c (nvptx_legitimate_address_p): Adjust prototype.
(nvptx_section_for_decl): If type of decl has a specific address
space, return it.
(nvptx_addr_space_from_address): Ditto.
NVPTX does not support alloca or variable-length stack allocations, thus
heap allocation needs to be used instead. I've opted to make this a generic
change instead of guarding it with an #ifdef: libgomp usually leaves thread
stack size up to libc, so avoiding unbounded stack allocation makes sense
This patch removes 0-size libgomp stubs where generic implementations can be
compiled for the NVPTX target.
It also removes non-stub critical.c, which contains assembly implementations
for GOMP_atomic_{start,end}, but does not contain implementations for
GOMP_critical_*. My understanding is that
This provides minimal implementations of gomp_dynamic_max_threads and
omp_get_num_procs.
* config/nvptx/proc.c: New.
---
libgomp/config/nvptx/proc.c | 40
1 file changed, 40 insertions(+)
diff --git a/libgomp/config/nvptx/proc.c b/libgomp/config/n
(note to reviewers: I'm not sure what we're after here, on the high level;
will be happy to rework the patch in a saner manner based on feedback, or even
drop it for now)
At the moment the attribute setting logic in omp-low.c is such that if a
function that should be present in target code does no
NVPTX provides vprintf, but there's no stream separation: everything is
printed as if into stdout. This is the minimal change to get error.c working.
* error.c [__nvptx__]: Replace vfprintf, fputs, fputc with [v]printf.
---
libgomp/error.c | 5 +
1 file changed, 5 insertions(+)
diff
The approach I've taken in libgomp/nvptx is to have a single entry point,
gomp_nvptx_main, that can take care of initial allocation, transferring
control to target region function, and finalization.
At the moment it has the prototype:
void gomp_nvptx_main(void (*fn)(void*), void *fndata);
but it'
(This patch serves as a straw man proposal to have something concrete for
discussion and further patches)
On PTX, stack memory is private to each thread. When master thread constructs
'omp_data_o' on its own stack and passes it to other threads via
GOMP_parallel by reference, other threads cannot
This patch ports team.c to nvptx by arranging an initialization/cleanup
routine, gomp_nvptx_main, that all (pre-started) threads can run. It
initializes a thread pool and proceeds to run gomp_thread_start in all threads
except thread zero, which runs original target region function.
Thread-privat
Note: this patch will have to be more complex if we go with 'approach 2'
described in a later patch, 07/14 "launch target functions via gomp_nvptx_main".
For OpenMP offloading, libgomp invokes 'gomp_nvptx_main' as the accelerator
kernel, passing it a pointer to outlined target region function. Th
On NVPTX, we don't need most of target.c functionality, except for GOMP_teams.
Provide it as a copy of the generic implementation for now (it most likely
will need to change down the line: on NVPTX we do need to spawn several
thread blocks for #pragma omp teams).
Alternatively, it might make sense
On NVPTX, there's 16 hardware barriers for each thread team, each barrier has
a variable waiter count. The instruction 'bar.sync N, M;' allows to wait on
barrier number N until M threads have arrived. M should be pre-multiplied by
warp width. It's also possible to 'post' the barrier without susp
On Tue, 20 Oct 2015, Bernd Schmidt wrote:
> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > Due to special treatment of types, emitting variables of type _Bool in
> > global scope is impossible: extern references are emitted with .u8, but
> > definitions use .u64.
On Tue, 20 Oct 2015, Bernd Schmidt wrote:
> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > This allows to emit decls in 'shared' memory from the middle-end.
> >
> > * config/nvptx/nvptx.c (nvptx_legitimate_address_p): Adjust prototype.
> >
On Tue, 20 Oct 2015, Bernd Schmidt wrote:
> On 10/20/2015 08:34 PM, Alexander Monakov wrote:
> > 2. Make gomp_nvptx_main a device (.func) function. To have that work, we'd
> > need to additionally emit a "trampoline" of sorts in the NVPTX backend. For
> &g
101 - 200 of 1023 matches
Mail list logo