Hi Xiuli, Apologies if I were not clear enough with my questions.
1. By saying that there is only 64K local memory, I assume you meant to hint that this is a scarce resource. We use only ~2KB local memory per work group required mostly for prefetching and reduction across work-group. By disabling pre-fetching we could get this down to a minimum of about 800 bytes. However, do you expect that to help in any way? Is the hardware capable of keeping in flight >30-32 waves of 64 threads? Also, I think I'm lacking some detailed knowledge as I do not see how is this related to the drm_intel_gem_bo_context_exec() issue. 2. As mentioned above, I have local work size = 64 and rely on splitting the work over the global grid (so small workloads will have 100s, large ones 10000s larger global work size). 3. Do you mean that I should *not* expect concurrency between CPU and GPU to be possible with beignet and clEnqueueWriteBuffer, clEnqueuereadBuffer, and clEnqueueNDRangeKernel will exhibit blocking behavior? I have not had time to file a bugzilla yet with reproduction details, sorry about that. The GROMACS source you'll need is in the master branch, plus the https://gerrit.gromacs.org/#/c/5752/2 change under review fixes some execution width assumptions. Other than that you'll need a small patch to enable Intel iGPUs (e.g found here https://bugs.freedesktop.org/show_bug.cgi?id=94265 which is BTW another bug on IVB). You'll also run into the include path issue I mentioned before for which you'll need to activate the Apple workaround, here's a patch: diff --git a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp index 2084d8c..8928582 100644 --- a/src/gromacs/gpu_utils/gpu_utils_ocl.cpp +++ b/src/gromacs/gpu_utils/gpu_utils_ocl.cpp @@ -131,6 +131,8 @@ static int is_gmx_supported_gpu_id(struct gmx_device_info_t *ocl_gpu_device) return egpuCompatible; case OCL_VENDOR_AMD: return runningOnCompatibleOSForAmd() ? egpuCompatible : egpuIncompatible; + case OCL_VENDOR_INTEL: + return egpuCompatible; default: return egpuIncompatible; } diff --git a/src/gromacs/gpu_utils/ocl_compiler.cpp b/src/gromacs/gpu_utils/ocl_compiler.cpp index 6a4772a..9aa3c1e 100644 --- a/src/gromacs/gpu_utils/ocl_compiler.cpp +++ b/src/gromacs/gpu_utils/ocl_compiler.cpp @@ -747,7 +747,8 @@ ocl_get_build_options_string(cl_context context, * OpenCL implementations are happy with. Since the standard still says * it should be quoted, we handle Apple as a special case. */ -#ifdef __APPLE__ +//#ifdef __APPLE__ +#if 1 std::string unescaped_ocl_root_path = get_ocl_root_path(); std::string ocl_root_path; Additionally, here's an input file you'll need to be able to start the program: https://www.dropbox.com/s/hm5t90iwo3xw5ws/water-48k-frozen.tpr?dl=0 which you can do with the following command: /PATH/gmx mdrun -s water-48k-frozen Let me know if something is unclear. Thanks for the help! Cheers, -- Szilárd On Thu, Mar 31, 2016 at 5:34 AM, Pan, Xiuli <[email protected]> wrote: > Hi Szilárd, > > > > Since you have some questions and I could not reproduce them here I just > make some response that I think may related to this problem: > > 1. We only have 64K Share local memory for all of the work groups > > 2. The drm_intel_gem_bo_context_exec() failed have a lot of reasons, > could give us the detail about your test about the execution wide? > > 3. As far as I know most enqueue in beignet default to be > blocking(some related to GPU is not blocking) , you can see that api > clFlush is actually an empty function. > > > > Also I am trying to reproduce your bug here and I am setting up GROMACS. > Is there anything I should know to run it with beignet? > > > > Thanks > > Xiuli > > > > *From:* Beignet [mailto:[email protected]] *On Behalf > Of *Szilárd Páll > *Sent:* Thursday, March 31, 2016 3:14 AM > *To:* [email protected] > *Subject:* Re: [Beignet] GROMACS on beignet > > > > Hello again, > > > > I have been trying to verify whether there may be assumptions >=32-wide > execution hiding in the kernels (in particular in code that's using local > memory for prefetching or reduction) and tried dropping in mem fences to > test a few things, but at several points I managed to trigger the > aforementioned error: > > drm_intel_gem_bo_context_exec() failed: Input/output error > > > > Is this a known issues? There have been reports of it, but perhaps it is > just the manifestation of multiple possible issues? > > > > Secondly, I do not see the reason why I get blocking behavior of all > enqueue operations (and I don't get this on NVIDIA or AMD). Are there any > peculiarities I should be aware of? > > > > Cheers, > > > -- > Szilárd > > > > On Mon, Mar 28, 2016 at 1:49 AM, Szilárd Páll <[email protected]> > wrote: > > Hi Xiuli, > > > > Thanks for the quick reply! > > > > On Fri, Mar 25, 2016 at 4:06 AM, Pan, Xiuli <[email protected]> wrote: > > Hi Szilárd, > > > > What do you mean about quoted includes? > > > > I mean -I"/path/to/headers" does not work, but -I/path/to/headers does. > > > > If you mean the include in kernels, I think we may have some problem with > that. The *.cl we used for clang actually was a copied tmp version stored > not in where is used to be. So I think if you just put what need to be > included in the old place, clang could not find it. You could try a > workaround to pass “-I where/your/header/is” as a build option to > clBuildProgram. > > > > Then if you have some double types used on Haswell it may have some > problem. The hardware for HSW does not support double very well as we have > refined our double support to hardware then, so HSW may have some issues > with double type. If it is not the problem with double float, you can send > your kernel as an attachment or report a bug on our Bugzilla( > https://bugs.freedesktop.org) and we will tried to fix it. > > > > No double precision in the kernels. > > > > For now I'll post here, I feel like a bug report may be an overkill - > especially as I can't provide a full repro case that does not involve > building the entire application. > > > > I've attached a minimum set of source files that's needed to compile. We > have pretty heavy preprocessor use that generates kernels for the different > inputs / outputs / computation combinations, so one particular flavor > that's known to produce incorrect results is generated compiling > nbnxn_ocl_kernels.cl with the following flags: > > > > -D_WARPLESS_SOURCE_ -DGMX_OCL_FASTGEN -DEL_RF -DEELNAME=_ElecRF > -DLJ_COMB_GEOM -DVDWNAME=_VdwLJCombGeom -DCENTRAL=22 > -DNBNXN_GPU_NCLUSTER_PER_SUPERCLUSTER=8 -DNBNXN_GPU_CLUSTER_SIZE=8 > -DNBNXN_GPU_JGROUP_SIZE=4 -DNBNXN_AVOID_SING_R2_INC=1.0e-12f > > > > > > Additionally I had a closer look and so far I have observed three issues > (additional to the minor include issue mentione before): > > > > 1. If I do a manual prefetch into local memory followed by a mem fence > (seenbnxn_ocl_kernel_nowarp.clh line 339), I get the following error: > > drm_intel_gem_bo_context_exec() failed: Input/output error > > The next kernel call then fails with CL_OUT_OF_RESOURCES. > > Without the manual prefetch it works better, but... > > > > 2. The results produced by the kernel are still somewhat off. It could be > that I missed a subtle detail and the kernels still do not conform to the > hardware's execution model. I'm very familar with Intel's hardware and > these kernels were originally designed for 32/64 wide execution. > > > > 3. All task enqueue calls seem to be blocking. > > > > > > Thanks & Cheers, > > -- > > Szilárd > > > > > > Thanks > > Xiuli > > > > *From:* Beignet [mailto:[email protected]] *On Behalf > Of *Szilárd Páll > *Sent:* Friday, March 25, 2016 7:16 AM > *To:* [email protected] > *Subject:* [Beignet] GROMACS on beignet > > > > Hi, > > > > I am a developer of the GROMACS (www.gromacs.org) molecular dynamics > simulation package. We have OpenCL offload for some of the > compute-intensive kernels which that works very well on AMD. I wanted to > assess how feasible is to use an Intel iGPU in GROMACS and after jumping > through some hoops I got a 4.2 kernel and beignet master installed. > > > > Then I ran into the first minor issue: it seems that beignet does not > accept quoted includes although AFAIK the double-quoted include paths > should be accepted, but that did not work. No big deal, it doesn't work > with Apple's OpenCL either, but I thought I'd ask. > > > > However, the bigger issue is that running on Haswell (HD 4600, I think) > the kernel produces results that are very off (while the very same source > gives correct results on other platforms). I've not much time to dig > deeper, but I thought I'd drop a mail maybe somebody is interested in > helping out with tips or even tracking down where the issue is. > > > > Suggestions would be welcome! > > > > Cheers, > > -- > Szilárd > > > > >
_______________________________________________ Beignet mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/beignet
