Hi Peter,

Unfortunately it's a GeForce 9400M and running with smaller dimensions also 
works fine so this does seem to explain the problem. Is there any GPU pattern 
for deconstructing nested loops to get around this issue?

        Cheers,

        Matthew



On Jun 20, 2011, at 11:46 PM, Peter Schmidtke wrote:

> Dear Matthew,
> 
> first of all what GPU you are using? A Geforce or a Tesla or something
> like that?
> 
> I see a quadruple loop in your kernel. If it's quick to execute there's
> no problem with that, however, be careful, if you have a Geforce for
> instance, there's a time limit a thread can run (5s I think), so the
> kernel cannot run forever. But I think that's not the case with Teslas
> for instance. Try to execute something shorter and see if it
> works....then add another sub-loop and you'll see that it's too slow.
> 
> ++
> 
> Peter
> 
> 
> On 06/21/2011 08:05 AM, Matthew Graham wrote:
>> Hi,
>> 
>> I'm having problems with big arrays; this crashes horribly:
>> 
>> from pycuda import driver, compiler, gpuarray, tools
>> from pycuda.curandom import rand as curand
>> import numpy as np
>> import pycuda.autoinit
>> 
>> NV = 5632 # No of vectors
>> DIM = 2031 # No of dimensions per vector
>> BLOCK_SIZE = 16 # 512 max threads per mp
>> GRID_SIZE = 352
>> HEIGHT = 8
>> WIDTH = 8
>> 
>> gj = curand((NV, DIM))
>> nodes = curand((HEIGHT, WIDTH, DIM)) # Each row is a vector
>> dissim = curand((DIM,DIM))
>> nearest = gpuarray.zeros((NV), np.float32)
>> 
>> kernel_code = '''
>>  __global__ void NearestPrototypeKernel(float *gj, float *nodes, float 
>> *dissim, float *nearest)
>>  {
>>    // Element
>>    int idx = blockIdx.x * blockDim.x + threadIdx.x;
>>    float value = 0.;
>> 
>>    for (int i = 0; i < 8; ++i) {
>>      for (int j = 0; j < 8; ++j) {
>>        for (int l = 0; l < 2031; ++l) {
>>          float wp_l = gj[idx * 2031 + l];
>>          if (wp_l > 0) {
>>            for (int k = 0; k < 2031; ++k) {
>>              value += 0.01;
>>            }
>>          }
>>        }
>>      }
>>    }
>> 
>>    nearest[idx] = value;
>>  }
>> '''
>> mod = compiler.SourceModule(kernel_code)
>> npker = mod.get_function("NearestPrototypeKernel")
>> npker(
>>    # inputs
>>    gj, nodes, dissim,
>>    # output
>>    nearest,
>>    # block of multiple threads
>>    block = (BLOCK_SIZE, 1, 1),
>>    # grid of blocks
>>    grid = (GRID_SIZE, 1)
>> )
>> a = nearest.get()
>> 
>> ---
>> 
>> with this:
>> 
>> /Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/compiler.py:122:
>>  UserWarning: The CUDA compiler suceeded, but said the following:
>> ptxas /tmp/tmpxft_000003e5_00000000-2_kernel.ptx, line 93; warning : Double 
>> is not supported. Demoting to float
>> 
>>  +stdout+stderr)
>> Traceback (most recent call last):
>>  File "kernel_test.py", line 54, in <module>
>>    a = nearest.get()
>>  File 
>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/gpuarray.py",
>>  line 177, in get
>>    drv.memcpy_dtoh(ary, self.gpudata)
>> pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch timeout
>> Error in atexit._run_exitfuncs:
>> Traceback (most recent call last):
>>  File 
>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py",
>>  line 24, in _run_exitfuncs
>>    func(*targs, **kargs)
>>  File 
>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py",
>>  line 13, in _finish_up
>>    context.pop()
>> LaunchError: cuCtxPopCurrent failed: launch timeout
>> Error in sys.exitfunc:
>> Traceback (most recent call last):
>>  File 
>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py",
>>  line 24, in _run_exitfuncs
>>    func(*targs, **kargs)
>>  File 
>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py",
>>  line 13, in _finish_up
>>    context.pop()
>> pycuda._driver.LaunchError: cuCtxPopCurrent failed: launch timeout
>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>> cuMemFree failed: invalid context
>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>> cuMemFree failed: invalid context
>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>> cuMemFree failed: invalid context
>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>> cuModuleUnload failed: invalid context
>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>> cuMemFree failed: invalid context
>> -------------------------------------------------------------------
>> PyCUDA ERROR: The context stack was not empty upon module cleanup.
>> -------------------------------------------------------------------
>> A context was still active when the context stack was being
>> cleaned up. At this point in our execution, CUDA may already
>> have been deinitialized, so there is no way we can finish
>> cleanly. The program will be aborted now.
>> Use Context.pop() to avoid this problem.
>> -------------------------------------------------------------------
>> Abort
>> 
>> ----
>> 
>> Does anybody have any idea to get around this?
>> 
>>      Cheers,
>> 
>>      Matthew
>> _______________________________________________
>> PyCUDA mailing list
>> [email protected]
>> http://lists.tiker.net/listinfo/pycuda
>> 
> 
> 
> -- 
> 
> Peter Schmidtke
> PhD Student
> Dept. Physical Chemistry
> Faculty of Pharmacy
> University of Barcelona
> 


_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to