On 06/21/2016 05:14 AM, Remy Dernat wrote:
Hi,

100 PF is really not far from reality right now:
http://www.top500.org/news/new-chinese-supercomputer-named-worlds-fastest-system-on-latest-top500-list/


I was curious about the CPU/architecture and I found:
  http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

I do wonder if they will sell this CPU on the open market and how hard it is to port normal linux+mpi codes to it.

My quick summary of possible interest. This seems like a pretty novel design. Kind of odd that they claim a node = socket. But a socket has 4 core groups each with access to 8GB of memory. So while sunway describes a single 32GB ram node, the normal terminology would call it 4 8GB nodes in a single socket.

Cluster:
* 40 racks
* 93 PFlop/sec
* 74.16% efficiency (much better than ORNL titan and NUDT tianhe-2)
* 1024 nodes per rack
* 40960 nodes total
* 6 Gflops/watt (around 3x anything in the top 6)

Physical layout:
* 4 node groups per socket
* 1 socket per node
* two nodes per card
* four cards per board
* 32 boards per supernode
* 4 supernodes per rack
* 40 racks in cluster

Network:
* 70TB/sec bisection bandwidth
* nodes connected using pci-e 3.0 connections
* supernode contains 256 nodes
* network diameter of 7
* node MPI bandwidth of 12GB/sec and a latency of about 1us.

Each rack:
  * 4 supernodes
  * 256 nodes per super node
  * total 1024 cores

Each node has:
* 3.06 Tflop/sec
* 1 socket, 260 cores (4 MPE and 4x64 CPE)
* 4 Core groups, each with:
  + 8x8 grid of cores (CPEs)
  + own memory space managed by MPE (management processing element)
  + 1 management CPU (MPE)
  + access to 8GB of DDR3 memory
* 4 128 bit memory controllers (DDR3-2133), each connected to 8GB of DDR3,
  total theoretical peak = 136.51GB/sec per chip.
* Network on chip (NoC) - bidirectional bandwidth of 16GB/sec to network, around
       1us latency.
* 6 Gflops/watt for processor, memory, and interconnect.

Each management core (4 per chip, one per 64 CPEs) has:
  * 64 bit risc OoO core
  * 264 bit vector instruction
  * 32KB l1i/32 KB l1d
  * 256KB L2
  * 16 flops/cycle

Each CPE (256 per chip) has:
  * 64 bit risc OoO core
  * supports only user mode
  * 264 bit vector instruction
  * 16KB L1i
  * 64KB scratch pad memory SPM
  * 8 double flops/cycle per core (6 at linpack)

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to