Re: [Beowulf] Parallella Epiphany performance

atchley tds.net Sat, 07 Jun 2014 13:17:20 -0700

Adapteva's CEO, Andreas Oloffson, gave a talk Friday at ORNL, which was
very well attended. He gave an interesting talk about how to program a
16,000 core chip, which was more about the architecture and design choices
than actually programming a 16K core chip. It is most impressive given that
it was a team of three over a period of three months.


The cores are simple, dual issue RISC with 32 KB of scratch pad and a
network router. There is no cache or coherency protocol. Every core can
read/write every other core's memory so that it can appear as a
distributed, shared memory machine. Non-local accesses are automatically
converted to network calls and sent out over the NoC. Nearest neighbor
latency is 4 ns for writes and 16 ns for reads. Farthest neighbor writes
are 16 ns and 30 ns reads. Routing is east/west then north/south. The cores
form a 2D mesh. He claims that they can build a 1,024 core chip today if
there is demand for it.

The initial markets are telecom, military, and medical and the applications
best suited for it would need a DSP. For HPC, they claim 102 GF/s at 2
watts (51 GF/watt), which is exascale class almost (i.e. 1 EF/s at 20 MW
ignoring cooling, networks, etc). It only has single-precision floating
point currently. They can add double-precision given enough demand.
Depending on the memory per core configured, it could provide a
double-precision peak performance about 30-40% less than the current board.

They support C/C++ and OpenCL. Actually, the latter is converted to C++ and
C++ is limited given the limited amount of memory. That said, if the bulk
of your program can fit under 1,500 lines of C, he asserts that it will
scream.

Lastly, once all the kickstarter boards go out, they hope to have them
available on Amazon for immediate delivery.

Scott



On Fri, May 23, 2014 at 9:32 AM, Eugen Leitl <eu...@leitl.org> wrote:

>
> After I've finally gotten my Kickstart backer board and set it
> up to boot (you will need the included heatsink on the Zynq 7020
> as well as a small fan) I've ran a few included benchmarks.
>
> In no particular order of relevance:
>
> linaro-nano:~/Parallella/epiphany-examples/mesh_bandwidth_all2one> ./run.sh
> 0x0000417e!
> The bandwidth of all-to-one is 4193.00MB/s!
>
>
> linaro-nano:~/Parallella/epiphany-examples/mesh_bandwidth_bisection>
> ./run.sh
> 0x00000f46!
> The bandwidth of bisection is 9590.00MB/s!
>
> linaro-nano:~/Parallella/epiphany-examples/basic_math> ./run.sh
>
> The clock cycle count for addition is 5.
>
> The clock cycle count for subtraction is 5.
>
> The clock cycle count for multiplication is 6.
>
> The clock cycle count for division is 47.
>
> The clock cycle count for "fmodf()" is 66635.
>
> The clock cycle count for "sinf()" is 23930.
>
> The clock cycle count for "cosf()" is 51115.
>
> The clock cycle count for "sqrtf()" is 93785.
>
> The clock cycle count for "ceilf()" is 18475.
>
> The clock cycle count for "floorf()" is 17690.
>
> The clock cycle count for "log10f()" is 10735.
>
> The clock cycle count for "logf()" is 9976.
>
> The clock cycle count for "powf()" is 348243.
>
> The clock cycle count for "ldexpf()" is 36306.
>
> linaro-nano:~/Parallella/epiphany-examples/matmul-16> ./run.sh
>
> Matrix: C[512][512] = A[512][512] * B[512][512]
>
> Using 4 x 4 cores
>
> Seed = 0.000000
> Loading program on Epiphany chip...
> Writing C[1048576B] to address 00200000...
> Writing A[1048576B] to address 00000000...
> Writing B[1048576B] to address 00100000...
> GO Epiphany! ...   Writing the GO!...
> Done...
> Finished calculating Epiphany result.
> Reading result from address 00200000...
> Calculating result on Host ...   Finished calculating Host result.
> Reading time from address 00300008...
>
> *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
> Verifying result correctness ...   C_epiphany == C_host
> *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
>
> Epiphany -  time:     153.0 msec  (@ 600 MHz)
> Host     -  time:    1867.2 msec  (@ 667 MHz)
>
> * * *   EPIPHANY FTW !!!   * * *
>
> I can run the rest of the examples and post numbers if there's
> interest:
>
> naro-nano:~/Parallella/epiphany-examples> ls -la
> total 152
> drwxrwxr-x 36 linaro linaro 4096 May 22 15:46 ./
> drwxrwxr-x  5 linaro linaro 4096 Mar  7 12:09 ../
> drwxrwxr-x  8 linaro linaro 4096 Mar  6 23:47 .git/
> -rw-rw-r--  1 linaro linaro  227 Mar  6 23:42 .gitignore
> -rw-rw-r--  1 linaro linaro 1464 Mar  6 23:42 README.md
> drwxrwxr-x  4 linaro linaro 4096 May 17 11:47 assembly/
> drwxrwxr-x  4 linaro linaro 4096 Mar  6 23:44 basic_math/
> drwxrwxr-x  4 linaro linaro 4096 Mar  6 23:47 clockgating_mode/
> drwxrwxr-x  4 linaro linaro 4096 May 17 11:48 ctimer/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 dma_2d/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 dma_chain/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 dma_interrupt/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 dma_message_read/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 dma_message_write/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 dma_slave/
> drwxrwxr-x  4 linaro linaro 4096 May 22 15:48 e-dump-mem/
> drwxrwxr-x  4 linaro linaro 4096 May 22 15:46 e-dump-regs/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 e-mem-sync/
> drwxrwxr-x  4 linaro linaro 4096 Mar  6 23:43 e-toggle-led/
> drwxrwxr-x  4 linaro linaro 4096 May 22 12:48 emesh_read_latency/
> drwxrwxr-x  4 linaro linaro 4096 May 22 12:48 emesh_traffic/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 erm/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 erm_example/
> drwxrwxr-x  4 linaro linaro 4096 Mar  6 23:42 fft2d/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 hardware_barrier/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 hardware_loops/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 hello_parallella/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 interrupts/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 link_lowpower_mode/
> drwxrwxr-x  4 linaro linaro 4096 Mar  7 02:04 matmul-16/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 mem_protect/
> drwxrwxr-x  4 linaro linaro 4096 May 23 13:26 mesh_bandwidth_all2one/
> drwxrwxr-x  4 linaro linaro 4096 May 22 12:42 mesh_bandwidth_bisection/
> drwxrwxr-x  4 linaro linaro 4096 May 22 12:41 mesh_bandwidth_neighbour/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 mutex/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 nested_interrupts/
> drwxrwxr-x  3 linaro linaro 4096 Mar  6 23:42 register_test/
> drwxrwxr-x  4 linaro linaro 4096 May 22 12:07 remote_call/
>
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Parallella Epiphany performance

Reply via email to