Hi Cesar! On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidis <ce...@codesourcery.com> wrote: > This patch does the followings: > > * Adjusts the default num_gangs to utilize more of GPU hardware. > * Teach libgomp to emit a diagnostic when num_workers isn't supported. > > [...]
Thanks! > This patch has been applied to gomp-4_0-branch. For easier review, I'm quoting here your r245393 commit with whitespace changes ignored: > --- libgomp/plugin/plugin-nvptx.c > +++ libgomp/plugin/plugin-nvptx.c > @@ -917,10 +918,15 @@ nvptx_exec (void (*fn), size_t mapnum, void > **hostaddrs, void **devaddrs, > seen_zero = 1; > } > > - if (seen_zero) > - { > - /* See if the user provided GOMP_OPENACC_DIM environment > - variable to specify runtime defaults. */ > + /* Both reg_granuarlity and warp_granuularity were extracted from > + the "Register Allocation Granularity" in Nvidia's CUDA Occupancy > + Calculator spreadsheet. Specifically, this required SM_30+ > + targets. */ > + const int reg_granularity = 256; That is per warp, so a granularity of 256 / 32 = 8 registers per thread. (Would be strange otherwise.) > + const int warp_granularity = 4; > + > + /* See if the user provided GOMP_OPENACC_DIM environment variable to > + specify runtime defaults. */ > static int default_dims[GOMP_DIM_MAX]; > > pthread_mutex_lock (&ptx_dev_lock); > @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void > **hostaddrs, void **devaddrs, > CUdevice dev = nvptx_thread()->ptx_dev->dev; > /* 32 is the default for known hardware. */ > int gang = 0, worker = 32, vector = 32; > - CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm; > + CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm; > > cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK; > cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE; > cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT; > cu_tpm = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR; > + cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR; > + cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR; > > if (cuDeviceGetAttribute (&block_size, cu_tpb, dev) == CUDA_SUCCESS > && cuDeviceGetAttribute (&warp_size, cu_ws, dev) == CUDA_SUCCESS > && cuDeviceGetAttribute (&dev_size, cu_mpc, dev) == CUDA_SUCCESS > - && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS) > + && cuDeviceGetAttribute (&cpu_size, cu_tpm, dev) == CUDA_SUCCESS > + && cuDeviceGetAttribute (&rf_size, cu_rf, dev) == CUDA_SUCCESS > + && cuDeviceGetAttribute (&sm_size, cu_sm, dev) == CUDA_SUCCESS) Trying to compile this on CUDA 5.5/331.113, I run into: [...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec': [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use in this function) cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each undeclared identifier is reported only once for each function it appears in [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first use in this function) cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For reference, please see the code handling CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version of the nvptx_open_device function. And then, I don't specifically have a problem with discontinuing CUDA 5.5 support, and require 6.5, for example, but that should be a conscious decision. > @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, > void **devaddrs, > matches the hardware configuration. Logical gangs are > scheduled onto physical hardware. To maximize usage, we > should guess a large number. */ > - if (default_dims[GOMP_DIM_GANG] < 1) > - default_dims[GOMP_DIM_GANG] = gang ? gang : 1024; That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also known as "default_dims[0]") is used to decide whether to enter this whole code block, and with that assignment removed, every call of the nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing, cuDeviceGetAttribute calls, computations, and so on. (See "GOMP_DEBUG=1" output.) I think this whole code block should be moved into the nvptx_open_device function, to have it executed once when the device is opened -- after all, all these are per-device attributes. (So, it's actually conceptually incorrect to have this done only once in the nvptx_exec function, given that this data then is used in the same process by/for potentially different hardware devices.) And, one could argue that the GOMP_OPENACC_DIM parsing conceptually belongs into generic libgomp code, instead of the nvptx plugin. (But that aspect can be cleaned up later: currently, the nvptx plugin is the only one supporting/using it.) > /* The worker size must not exceed the hardware. */ > if (default_dims[GOMP_DIM_WORKER] < 1 > || (default_dims[GOMP_DIM_WORKER] > worker && gang)) > @@ -998,9 +1007,56 @@ nvptx_exec (void (*fn), size_t mapnum, void > **hostaddrs, void **devaddrs, > } > pthread_mutex_unlock (&ptx_dev_lock); > > + int reg_used = -1; /* Dummy value. */ > + cuFuncGetAttribute (®_used, CU_FUNC_ATTRIBUTE_NUM_REGS, function); Why need to assign this "dummy value"? Now, per my understanding, this "function attribute" must be constant for all calls of that function. So, shouldn't this be queried once, after "linking" the code (link_ptx function)? And indeed, that's what the trunk version of the nvptx plugin's GOMP_OFFLOAD_load_image function is doing. > + > + int reg_per_warp = ((reg_used * warp_size + reg_granularity - 1) > + / reg_granularity) * reg_granularity; > + > + int threads_per_sm = (rf_size / reg_per_warp / warp_granularity) > + * warp_granularity * warp_size; > + > + if (threads_per_sm > cpu_size) > + threads_per_sm = cpu_size; (It's too late right now for me to follow these computations.) > + > + if (seen_zero) > + { > for (i = 0; i != GOMP_DIM_MAX; i++) > if (!dims[i]) > + { > + if (default_dims[i] > 0) > dims[i] = default_dims[i]; > + else > + switch (i) { > + case GOMP_DIM_GANG: > + dims[i] = 2 * threads_per_sm / warp_size * dev_size; > + break; Please document that constant factor "2", and in fact a comment on that whole computation certainly wouldn't hurt. And, as I suggested internally, wouldn't a (moderately?) higher constant factor instead of just "2" actually be yet better, to ensure higher occupancy of the hardware? > + case GOMP_DIM_WORKER: > + case GOMP_DIM_VECTOR: > + dims[i] = warp_size; > + break; Using "warp_size" for "dims[GOMP_DIM_WORKER]" is conceptually wrong. > + default: > + abort (); > + } > + } > + } > + > + /* Check if the accelerator has sufficient hardware resources to > + launch the offloaded kernel. */ > + if (dims[GOMP_DIM_WORKER] > 1) > + { Per the previous computation, doesn't it always hold that "dims[GOMP_DIM_WORKER]" is bigger than one? > + int threads_per_block = threads_per_sm > block_size > + ? block_size : threads_per_sm; > + > + threads_per_block /= warp_size; (Again, given that late hour, I'm not trying to follow these computations.) > + > + if (dims[GOMP_DIM_WORKER] > threads_per_block) > + GOMP_PLUGIN_fatal ("The Nvidia accelerator has insufficient resources " > + "to launch '%s'; recompile the program with " > + "'num_workers = %d' on that offloaded region or " > + "'-fopenacc-dim=-:%d'.\n", > + targ_fn->launch->fn, threads_per_block, > + threads_per_block); > } ACK -- until we come up with a better solution. Grüße Thomas