Craig [et.al], this is also how I understand. One could realistically wrap standard MPI calls to do this for you:
MPI_GPU_Bcast(...){ malloc_some_stuff pull_mem_from_gpu MPI_Bcast(...) free_some_stuff } ...just a random thought though... On Fri, 2008-06-20 at 08:13 -0600, Craig Tierney wrote: > Kilian CAVALOTTI wrote: > > On Thursday 19 June 2008 04:32:11 pm Chris Samuel wrote: > >> ----- "Kilian CAVALOTTI" <[EMAIL PROTECTED]> wrote: > >>> AFAIK, the multi GPU Tesla boxes contain up to 4 Tesla processors, > >>> but are hooked to the controlling server with only 1 PCIe link, > >>> right? Does this spell like "bottleneck" to anyone? > >> The nVidia website says: > >> > >> http://www.nvidia.com/object/tesla_tech_specs.html > >> > >> # 6 GB of system memory (1.5 GB dedicated memory per GPU) > > > > The latest S1070 has even more than that: 4GB per GPU as it seems, > > according to [1]. > > > > But I think this refers to the "global memory", as decribed in [1] > > (slide 12, "Kernel Memory Access"). It's the graphics card main memory, > > the kind of one which is used to store textures in games, for instance. > > Each GPU core also has what they call "shared memory" and which is > > really only shared between threads on the same core (it's more like a > > L2 cache actually). > > > >> So my guess is that you'd be using local RAM not the > >> host systems RAM whilst computing. > > > > Right, but at some point, you do need to transfer data from the host > > memory to the GPU memory, and back. That's where there's probably a > > bottleneck if all 4 GPUs want to read/dump data from/to the host at the > > same time. > > > > Moreover, I don't think that the different GPUs can work together, ie. > > exchange data and participate to the same parallel computation. Unless > > they release something along the lines of a CUDA-MPI, those 4 GPUs > > sitting in the box would have to be considered as independent > > processing units. So as I understand it, the scaling benefits from your > > application's parallelization would be limited to one GPU, no matter > > how many you got hooked to your machine. > > > > You can integrate MPI with CUDA and create parallel applications. CUDA is > just a preprocessor that uses the local C compiler (gcc for Linux by default). > I have seen some messages on the MVAPICH mailing list talking about users > doing this. > > Since the memory is on the card, you have to transfer back to the host > before you can send it via an MPI call. However, if your entire model > can fit in the GPU's memory (which is why the 4GB S1070 Tesla card is > useful), then you should be able to pull down the portion of memory > from the GPU you want to send out, then send it. > > Or at least that's how I understand it. When I get my systems I will > get to figure out the "real" details. > > Craig > > > > > > > I don't even know how you choose (or even if you can choose) on which > > GPU you want your code to be executed. It has to be handled by the > > driver on the host machine somehow. > > > >> There's a lot of fans there.. > > > > They probably get hot. At least the G80 do. They say "Typical Power > > Consumption: 700W" for the 4 GPUs box. Given that a modern gaming rig > > featuring a pair of 8800GTX in SLI already requires a 1kW PSU, I would > > put this on the optimistic side. > > > > [1]http://www.nvidia.com/object/tesla_s1070.html > > [2]http://www.mathematik.uni-dortmund.de/~goeddeke/arcs2008/C1_CUDA.pdf > > > > > > Cheers, > > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf