RE: [casper] PL data to PS DDR4 (AXI) {External}

Matthew Schiller Thu, 05 Oct 2023 08:06:00 -0700

Yeah depends on what you want to do with the data.  If the ARM is further 
processing the data, than DMA usually makes sense, because the ARM can access 
it’s memory more quickly and use the L1/L2 Cache for data in the PS memory, 
plus avoiding the ARM spending process clock cycles reading data from the PL 
and copying it into the PS memory (which is likely what will happen in some 
processing, say if you were executing an FFT in software or trying to generate 
an ethernet packet in software to send the data to a user display)

One copy isn’t too bad, but it can get ridiculous if you make more than one 
copy.  But there’s also how the copy is done… What you almost never want to 
happen is a memcopy done via software for loop or similar, because if you 
aren’t extremely careful coding that up in software it won’t even use the ARM 
DMA blocks buried in the processor to do it.  So if your software application 
end up reading from the PL memory and writing to another block of memory (eg a 
software defined array variable) (in the PS) that’s probably a sign that DMA is 
the right thing to do, so the ARM isn’t spending processor cycles just copying 
data.  But whenever you do something with DMA… Your complexity just shot 
through the roof, even if it can technically net performance gains.

But there are certainly cases were there is no appreciable performance gain, 
and the complexity is not warranted.  Eg if you aren’t heavily using the ARM 
processor for processing and just desperately want to read some I/Q data into a 
file on disk.   You might not care about the fact that the processor is 
spinning reading data in that case.  Especially since using DMA effectively in 
Linux for “direct storage” to a nvme drive or SD card is well that’s really 
only something the industry is just starting to do on X86 machines (eg Direct 
Storage with GPUs), so you probably are going to be forced to let software 
handle that anyway on ARM rather than writing a very complicated driver-level 
code to do it….

Gah complicated software stuff galore in that discussion…  Because properly 
handling data movement in embedded (software) systems for highest performance 
is well complicated….  And I probably shouldn’t talk much about it… I’m an FPGA 
engineer not a Kernel-level Embedded software engineer..

[AB72FAB9]
Matthew Schiller
ngVLA Digital Backend Lead
NRAO

[email protected]<mailto:[email protected]>
315-316-2032

Matthew Schiller

From: 'Ross Martin' via [email protected] <[email protected]>
Sent: Thursday, October 5, 2023 10:46 AM
To: casper <[email protected]>
Subject: Re: [casper] PL data to PS DDR4 (AXI) {External}

DMA isn't always the best answer.

It's sometimes best to just leave the data in the PL and have the processor 
access it directly.

If the processor reads the data directly, it's just accessed once, and only the 
data you need is accessed.

If you transfer via DMA, it's read once by the DMA from the PL, written to PS 
memory once by the DMA, and then read again by the processor to do it's 
processing.  Also, the DMA must potentially transfer more data than the current 
processing actually needs, since it may need to account for contingencies.

So although DMA access *might* be faster access, it's definitely accessing the 
data more times.  It won't always be worth it.

DMA also adds an additional layer of software complexity.

As an example of the non-DMA solution, the demo I released for the RFSoC4x2 
pulls the data directly from the PL into the ARM without doing any DMA.

Regards,

Ross

On Thu, Oct 5, 2023, 3:06 PM Jack Hickish 
<[email protected]<mailto:[email protected]>> wrote:
This seems like a fun "discuss at the workshop" topic!
I have a couple of applications where I think this functionality would be 
useful, so I'd definitely be interested in helping out.

>From a toolflow side I think getting the automated instantiation of the DMA IP 
>should be relatively straightforward. Handling what the CPU does to interact 
>with the core, and/or how you might interact with the core remotely over a 
>network I'm less sure about.

Cheers
Jack

On Thu, 5 Oct 2023 at 12:08, Matthew Schiller 
<[email protected]<mailto:[email protected]>> wrote:
The right way to do what you describe is with the axi DMA block, but as you 
point out that has a software interface to configure the transfer.  The main 
data would flow over an AXI4 “full” interface that supports burst transactions 
(but the Xilinx-provided DMA block already does that), and the configuration of 
the DMA block comes from software over AXI4 lite. There are two approaches 
(which should be supported by either using the correct DMA block or the correct 
settings on the DMA block).  A Standard DMA block can be used if fixed 
addresses in memory can be allocated.  This would mean that the linux kernel is 
told to only use ½ of the PS memory for example.  Software can still access the 
upper half though for example /dev/mem reads, but the upper memory disappears 
from linux for normal applications..  Alternatively, though more complicated, a 
“scatter-gather” DMA is implemented.  A Scatter Gather DMA uses a software 
driver/server that will “malloc” memory in a normal software way, and then 
provide pointers to the Scatter Gather DMA to that memory.  Because of the way 
virtual memory works, this is not as trivial as it sounds and is requires 
several steps to accomplish as the FPGA needs the physical, not virtual 
address, and must respect the fact that memory is allocated in virtual memory 
on “pages” and not necessarily contiguously.

sgDMA is better in many systems though because linux can still access all the 
memory so if you aren’t recording data, for example, more complicated software 
applications can run.

I don’t believe this has been done yet in casper, but it is possible since 
these are standard Xilinx provided blocks.   We just need to get the block 
instantiated properly in sysgen to accept an AXI streaming data stream from 
your DSP algorithm or the ADCs. and then on the ARM processor we need 
appropriate software/drivers to allocate memory and configure the DMA.

I think I heard a rumor that it was planned, but hasn’t been tackled yet.

With AXIDMA, you can probably get to around 20Gbit/sec (in theory probably as 
high as 40 depending on what speed the DDR4 train to) or better transfer 
performance to the PL.  Not that the little arm on these FPGAs can do much with 
that speed of data, but for recording a snippit of data or something like that 
that can allow some fairly significant sample rates of I/Q data for example.  
(at 8-bit I/Q that’s >1GSPS!).  If instead you did the register approach you 
mentioned I would expect rates around 100Mbit/sec to be possible, and to 
achieve that the processor in the ARM will be going nuts, because AXI4-Lite 
tends to require the processor to spin (DMA frees the processor for other 
stuff, while polling registers takes time to accomplish)

FWIW: ngVLA plans to create functionality like this in “pure” hdl and given the 
current effort to use more VHDL/Verilog blocks in casper ngVLA’s work may be 
useful in the future.   I hope to make progress on ngVLA’s approach later this 
calendar year. But ngVLA is on Intel FPGAs so a porting process would still be 
required to get that into Casper.

From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>> On Behalf Of Ken 
Semanov
Sent: Thursday, October 5, 2023 4:11 AM
To: [email protected]<mailto:[email protected]>
Subject: [casper] PL data to PS DDR4 (AXI) {External}

Is there an obvious way to migrate data from the PL into memory that is mapped 
into the address space of the PS?   Ideally I would use axi_interconnect as 
shown 
https://casper-toolflow.readthedocs.io/en/latest/axi4lite_documentation.html

A possible approach is to instantiate axi_dma within the PL , and the PL acts 
as the master during transfers. But the axi_dma exposes a AXI4-Lite slave port 
to the PS so that the PS configures and starts the transfers.   The receiving 
raw device would be the memory controller of the PS DDR4.     (Presumably the 
data is accessed later by software via the DMA engine).

Another approach would be to expose a single register, and perform this slowly 
word-by-word (without streaming or bursting.)

Is this plausible in CASPER,  or are steep changes required?

--
You received this message because you are subscribed to the Google Groups 
"[email protected]<mailto:[email protected]>" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/0424800a-035f-447f-92ed-07402b9d0239n%40lists.berkeley.edu<https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/0424800a-035f-447f-92ed-07402b9d0239n%40lists.berkeley.edu?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"[email protected]<mailto:[email protected]>" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/BL0PR14MB352338AEAFC2A58D667699CEABCAA%40BL0PR14MB3523.namprd14.prod.outlook.com<https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/BL0PR14MB352338AEAFC2A58D667699CEABCAA%40BL0PR14MB3523.namprd14.prod.outlook.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"[email protected]<mailto:[email protected]>" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAG1GKSnF8Bc7bfMajOVBEqdgdkhp2q3DiikSLRz4jpQX--1RCg%40mail.gmail.com<https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAG1GKSnF8Bc7bfMajOVBEqdgdkhp2q3DiikSLRz4jpQX--1RCg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"[email protected]<mailto:[email protected]>" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAG4nf730H6Hy%2B1krgq%2BRSYaBtVq8YHPikqz2ddoa61t4nEi0OA%40mail.gmail.com<https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAG4nf730H6Hy%2B1krgq%2BRSYaBtVq8YHPikqz2ddoa61t4nEi0OA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/BL0PR14MB3523893523EA0FBFBBE04949ABCAA%40BL0PR14MB3523.namprd14.prod.outlook.com.

RE: [casper] PL data to PS DDR4 (AXI) {External}

Reply via email to