Thanks for your reply.

Now, I really want to know whether clflush instruction evicts all memory line 
including the Last Level Cache(LLC), if so, hurt performance defenitly.

Here is paper I searched which said clflush indeed did like that.
http://eprint.iacr.org/2013/448.pdf







At 2014-12-03 13:57:36, "Zhigang Gong" <[email protected]> wrote:
>On Wed, Dec 03, 2014 at 01:46:51PM +0800, spring_wind wrote:
>> 
>> 
>> I noticed that in kernel's source(3.12.13) drm_clflush_page used clflush 
>> instruction.
>> I wonder whether the clfush instruction only push data to LLC or RAM? If it 
>> is RAM, is it needless since CPU and GPU share LLC.
>
>Right, they share the LLC, but doesn't share other level of caches,
>so when swtich from CPU domain to GPU domain, CPU needs to flush CPU's
>L1/L2 cache to LLC cache. When switch from GPU domain to CPU domain,
>GPU needs to flush GPU's L3 cache to LLC cache.
>
>Please be noted, the L3 cache I mentioned is in GPU domain.
>> 
>> We are planning to use beignet to accelerate video analysis and image 
>> processing. The data flow will involve input and output between GPU and CPU, 
>> we expect it will be not a bottleneck.
>Although cache flush between L3 <--> LLC <--> L2/L1 is not as heavy
>as the flush between LLC and memory, it's still not free. Especially
>if your application need to do that domain switch freqently.
>
>> 
>> 
>> 
>> At 2014-12-03 10:36:28, "Zhigang Gong" <[email protected]> wrote:
>> >On Wed, Dec 03, 2014 at 10:44:26AM +0800, spring_wind wrote:
>> >> 1. clCreateBuffer(CL_MEM_USE_HOST_PTR)
>> >> 2. clMapxx-->change the data pointed by host ptr--->clUnmapxxx
>> >> 
>> >> Is that right?
>> >The usual use scenario is as below:
>> >
>> >1. clCreateBuffer(CL_MEM_USE_HOST_PTR)
>> >2. clEnqueueNDRangeKernel() to enqueue some kernel to do some
>> >   computation on GPU side and write to the above buffer.
>> >3. clEnqueueMapBuffer() the buffer, and do some computation
>> >   on CPU side and modify the buffer content pointed by host_ptr.
>> >4. clEnqueueUnmapBuffer() to give the buffer back to GPU.
>> >
>> >The step 3 and 4 are synchronization points defined by OpenCL
>> >spec which are mandatory on all platforms to access a buffer
>> >pointed by host_ptr shared by GPU and CPU. At step3, not only
>> >a GPU cache flush (flush from GPU's L3 cache to CPU's LLC cache),
>> >but also need to wait for all the kernel enqueued in step 2
>> >to be finished.
>> >
>> >
>> >> 
>> >> I think clUnmapxx must contain cache flush operation although it may not 
>> >> involve extra copy, correct me if am wrong.
>> >
>> >You are right.
>> >The cache flush is always a must when we switch between GPU
>> >domain and CPU domain. But you can see, the above flush is
>> >not so heavy, as it's not a flush between cache and slow
>> >memory.
>> >
>> >> I am very concerned about the beignet performance on our product.
>> >
>> >You are welcome to share your detail concerns with us if possible.
>> >And there are also some performance hint in the 
>> >'docs/optimization-guide.mdwn'.
>> >We will continue to update that document.
>> >
>> >Thanks,
>> >Zhigang Gong.
>> >
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 在 2014-12-03 10:14:10,"Guo, Yejun" <[email protected]> 写道:
>> >> 
>> >> 
>> >> Just need to invoke clMapxx function before you access the host ptr from 
>> >> CPU side, and invoke clUnmapxx function after the CPU access. There is no 
>> >> extra copy inside clMap/Unmapxx if all the conditions are satisfied. You 
>> >> can refer to the code in runtime_use_host_ptr_buffer.cpp under 
>> >> beignet/utests.
>> >> 
>> >>  
>> >> 
>> >>  
>> >> 
>> >> From: spring_wind [mailto:[email protected]]
>> >> Sent: Wednesday, December 03, 2014 10:02 AM
>> >> To: Guo, Yejun
>> >> Cc:[email protected]
>> >> Subject: Re:RE: [Beignet] CL_MEM_USE_HOST_PTR involve extra copy?
>> >> 
>> >>  
>> >> 
>> >> You mean if I use CL_MEM_USE_HOST_PTR and host ptr is page aligned, I 
>> >> change the data pointed by host ptr from CPU side, I don't have to do any 
>> >> flush opertaion or I should still call clMapxx function?
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 在 2014-12-03 08:37:15,"Guo, Yejun" <[email protected]> 写道:
>> >> 
>> >> 
>> >> 
>> >> Hi,
>> >> 
>> >>  
>> >> 
>> >> please check the latest code of beignet, there is no copy needed between 
>> >> CPU and GPU if the host_ptr provided by application is page aligned, and 
>> >> the page align limitation is expected to be removed some days later. You 
>> >> can also try CL_MEM_ALLOC_HOST_PTR to avoid the extra copy without align 
>> >> limitation.
>> >> 
>> >>  
>> >> 
>> >> Btw, this zero-copy is supported starting from linux kernel 3.16 and 
>> >> libdrm 2.4.58.
>> >> 
>> >>  
>> >> 
>> >>  
>> >> 
>> >> From: Beignet [mailto:[email protected]] On Behalf Of 
>> >> spring_wind
>> >> Sent: Tuesday, December 02, 2014 8:38 PM
>> >> To:[email protected]
>> >> Subject: [Beignet] CL_MEM_USE_HOST_PTR involve extra copy?
>> >> 
>> >>  
>> >> 
>> >> Hi:
>> >> 
>> >> Intel opencl optimization guide said using CL_MEM_USE_HOST_PTR can avoid 
>> >> extra copy between CPU and GPU, but I noticed that in beignet's 
>> >> implementation source it was not like that:
>> >> 
>> >> /* Copy the data if required */
>> >> 
>> >>   if (flags & CL_MEM_COPY_HOST_PTR || flags & CL_MEM_USE_HOST_PTR)
>> >> 
>> >>     cl_buffer_subdata(mem->bo, 0, sz, data);
>> >> 
>> >>  
>> >> 
>> >> Could someone give me an answer?
>> >> 
>> >>  
>> >> 
>> >>  
>> >
>> >> _______________________________________________
>> >> Beignet mailing list
>> >> [email protected]
>> >> http://lists.freedesktop.org/mailman/listinfo/beignet
>> >
>> >_______________________________________________
>> >Beignet mailing list
>> >[email protected]
>> >http://lists.freedesktop.org/mailman/listinfo/beignet
>_______________________________________________
>Beignet mailing list
>[email protected]
>http://lists.freedesktop.org/mailman/listinfo/beignet
_______________________________________________
Beignet mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/beignet

Reply via email to