Scalable Distributed Memory Machines

Goal: *Parallel machines that can be scaled to hundreds or thousands of processors.*

- **Design Choices:**
  - Custom-designed or commodity nodes?
  - Network scalability.
  - Capability of node-to-network interface (*critical*).
  - Supporting programming models?

- **What does hardware scalability mean?**
  - Avoids inherent design limits on resources.
  - Bandwidth increases with machine size $P$.
  - Latency should not increase with machine size $P$.
  - Cost should increase slowly with $P$. 
MPPs Scalability Issues

• Problems:
  – Memory-access latency.
  – Interprocess communication complexity or synchronization overhead.
  – Multi-cache inconsistency.
  – Message-passing and message processing overheads.

• Possible Solutions:
  – Fast dedicated, proprietary and scalable, networks and protocols.
  – Low-latency fast synchronization techniques possibly hardware-assisted.
  – Hardware-assisted message processing in communication assists (node-to-network interfaces).
  – Weaker memory consistency models.
  – Scalable directory-based cache coherence protocols.
  – Shared virtual memory.
  – Improved software portability; standard parallel and distributed operating system support.
  – Software latency-hiding techniques.
One Extreme:
Limited Scaling of a Bus

<table>
<thead>
<tr>
<th>Characteristic</th>
<th>Bus</th>
</tr>
</thead>
<tbody>
<tr>
<td>Physical Length</td>
<td>~ 1 ft</td>
</tr>
<tr>
<td>Number of Connections</td>
<td>fixed</td>
</tr>
<tr>
<td>Maximum Bandwidth</td>
<td>fixed</td>
</tr>
<tr>
<td>Interface to Comm. medium</td>
<td>memory inf</td>
</tr>
<tr>
<td>Global Order</td>
<td>arbitration</td>
</tr>
<tr>
<td>Protection</td>
<td>Virt -&gt; physical</td>
</tr>
<tr>
<td>Trust</td>
<td>total</td>
</tr>
<tr>
<td>OS</td>
<td>single</td>
</tr>
<tr>
<td>comm. abstraction</td>
<td>HW</td>
</tr>
</tbody>
</table>

• **Bus:** Each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components.

Poor Scalability
Another Extreme: Scaling of Workstations in a LAN?

<table>
<thead>
<tr>
<th>Characteristic</th>
<th>Bus</th>
<th>LAN</th>
</tr>
</thead>
<tbody>
<tr>
<td>Physical Length</td>
<td>~ 1 ft</td>
<td>KM</td>
</tr>
<tr>
<td>Number of Connections</td>
<td>fixed</td>
<td>many</td>
</tr>
<tr>
<td>Maximum Bandwidth</td>
<td>fixed</td>
<td>???</td>
</tr>
<tr>
<td>Interface to Comm. medium</td>
<td>memory inf</td>
<td>peripheral</td>
</tr>
<tr>
<td>Global Order</td>
<td>arbitration</td>
<td>???</td>
</tr>
<tr>
<td>Protection</td>
<td>Virt -&gt; physical</td>
<td>OS</td>
</tr>
<tr>
<td>Trust</td>
<td>total</td>
<td>none</td>
</tr>
<tr>
<td>OS</td>
<td>single</td>
<td>independent</td>
</tr>
<tr>
<td>comm. abstraction</td>
<td>HW</td>
<td>SW</td>
</tr>
</tbody>
</table>

- No clear limit to physical scaling, no global order, consensus difficult to achieve.
Bandwidth Scalability

- Depends largely on network characteristics:
  - Channel bandwidth.
  - Static: Topology: Node degree, Bisection width etc.
  - Multistage: Switch size and connection pattern properties.
  - Node-to-network interface capabilities.
Dancehall MP Organization

- Network bandwidth?
- Bandwidth demand?
  - Independent processes?
  - Communicating processes?
- Latency?

Extremely high demands on network in terms of bandwidth, latency even for independent processes.
Generic Distributed Memory Organization

OS Supported?
Network protocols?

Communication Assist
Extent of functionality?

Global virtual
Shared address space?

Message transaction
DMA?

Node:
O(10) Bus-based SMP

Custom-designed CPU?
Node/System integration level?
How far? Cray-on-a-Chip?
SMP-on-a-Chip?

Multi-stage interconnection network (MIN)?
Custom-designed?

• Network bandwidth?
• Bandwidth demand?
  – Independent processes?
  – Communicating processes?
• Latency? \( O(\log_2 P) \) increase?
• Cost scalability of system?
Key System Scaling Property

- Large number of independent communication paths between nodes.
  => Allow a large number of concurrent transactions using different channels.
- Transactions are initiated independently.
- No global arbitration.
- Effect of a transaction only visible to the nodes involved
  - Effects propagated through additional transactions.
Network Latency Scaling

- $T(n) = \text{Overhead} + \text{Channel Time} + \text{Routing Delay}$
- Scaling of overhead?
- Channel Time($n$) = $n/B$ --- BW at bottleneck
- RoutingDelay($h,n$)
Network Latency Scaling Example

\( O(\log_2 n) \) Stage MIN using switches:

- **Max distance:** \( \log_2 n \)
- **Number of switches:** \( \alpha n \log n \)
- overhead = 1 us, BW = 64 MB/s, 200 ns per hop
- Using pipelined or cut-through routing:
  - \( T_{64}(128) = 1.0 \text{ us} + 2.0 \text{ us} + 6 \text{ hops} \times 0.2 \text{ us/hop} = 4.2 \text{ us} \)
  - \( T_{1024}(128) = 1.0 \text{ us} + 2.0 \text{ us} + 10 \text{ hops} \times 0.2 \text{ us/hop} = 5.0 \text{ us} \)

**Only 20% increase in latency for 16x size increase**

- **Store and Forward**
  - \( T_{64}^{sf}(128) = 1.0 \text{ us} + 6 \text{ hops} \times (2.0 + 0.2) \text{ us/hop} = 14.2 \text{ us} \)
  - \( T_{1024}^{sf}(128) = 1.0 \text{ us} + 10 \text{ hops} \times (2.0 + 0.2) \text{ us/hop} = 23 \text{ us} \)
Cost Scaling

• cost(p,m) = fixed cost + incremental cost (p,m)
• Bus Based SMP?
• Ratio of processors : memory : network : I/O ?

• Parallel efficiency(p) = Speedup(P) / P

• Similar to speedup, one can define:
  Costup(p) = Cost(p) / Cost(1)

• Cost-effective: speedup(p) > costup(p)
Cost Effective?

2048 processors: 475 fold speedup at 206x cost

\[ \text{Speedup} = \frac{P}{1 + \log P} \]

\[ \text{Costup} = 1 + 0.1 P \]
# Parallel Machine Network Examples

<table>
<thead>
<tr>
<th>Machine</th>
<th>Topology</th>
<th>Cycle Time (ns)</th>
<th>Channel Width (bits)</th>
<th>Routing Delay (cycles)</th>
<th>Flit (data bits)</th>
</tr>
</thead>
<tbody>
<tr>
<td>nCUBE/2</td>
<td>Hypercube</td>
<td>25</td>
<td>1</td>
<td>40</td>
<td>32</td>
</tr>
<tr>
<td>TMC CM-5</td>
<td>Fat-Tree</td>
<td>25</td>
<td>4</td>
<td>10</td>
<td>4</td>
</tr>
<tr>
<td>IBM SP-2</td>
<td>Banyan</td>
<td>25</td>
<td>8</td>
<td>5</td>
<td>16</td>
</tr>
<tr>
<td>Intel Paragon</td>
<td>2D Mesh</td>
<td>11.5</td>
<td>16</td>
<td>2</td>
<td>16</td>
</tr>
<tr>
<td>Meiko CS-2</td>
<td>Fat-Tree</td>
<td>20</td>
<td>8</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<td>CRAY T3D</td>
<td>3D Torus</td>
<td>6.67</td>
<td>16</td>
<td>2</td>
<td>16</td>
</tr>
<tr>
<td>DASH</td>
<td>Torus</td>
<td>30</td>
<td>16</td>
<td>2</td>
<td>16</td>
</tr>
<tr>
<td>J-Machine</td>
<td>3D Mesh</td>
<td>31</td>
<td>8</td>
<td>2</td>
<td>8</td>
</tr>
<tr>
<td>Monsoon</td>
<td>Butterfly</td>
<td>20</td>
<td>16</td>
<td>2</td>
<td>16</td>
</tr>
<tr>
<td>SGI Origin</td>
<td>Hypercube</td>
<td>2.5</td>
<td>20</td>
<td>16</td>
<td>160</td>
</tr>
<tr>
<td>Myricom</td>
<td>Arbitrary</td>
<td>6.25</td>
<td>16</td>
<td>50</td>
<td>16</td>
</tr>
</tbody>
</table>
Physical Scaling

• Chip-level integration:
  – Integrate network interface, message router I/O links.
    • nCUBE/2, Alpha 21364, IBM Power 4
  – IRAM-style Cray-on-a-Chip: V-IRAM
  – Memory/Bus controller/chip set: Alpha 21364
  – SMP on a chip: Chip Multiprocessor (CMP): IBM Power 4

• Board-level:
  – Replicating using standard microprocessor cores.
    • CM-5 replicated the core of a Sun SparkStation 1 workstation.
    • Cray T3D and T3E replicated the core of a DEC Alpha workstation.

• System level:
  • IBM SP-2 uses 8-16 almost complete RS6000 workstations placed in racks.
Chip-level integration Example:

nCUBE/2 Machine Organization

- Entire machine synchronous at 40 MHz

- 64 nodes socketed on a board

- 13 links up to 8096 nodes possible

- 500,000 transistors (considered large at the time)
Chip-level integration Example:
Vector Intelligent RAM 2 (V-IRAM-2)

Projected 2003.  < 0.1 µm,  > 2 GHz
16 GFLOPS(64b)/64 GOPS(16b)/128MB

2-way Superscalar Processor

Memory Crossbar Switch

Vector Registers

8 K I cache
8 K D cache

Load/Store

I/O

Vector Instruction Queue

8 x 64
8 x 64
8 x 64
8 x 64
8 x 64

8 x 64
or
16 x 32
or
32 x 16
Chip-level integration Example:

- Alpha 21264 core with enhancements
- Integrated Direct RAMbus memory controller:
  - 800 MHz operation, 30ns CAS latency pin to pin, 6 GB/sec read or write bandwidth
  - Directory based cache coherence
- Integrated network interface:
  - Direct processor-to-processor interconnect, 10 GB/second per processor
  - 15ns processor-to-processor latency, Out-of-order network with adaptive routing
  - Asynchronous clocking between processors, 3 GB/second I/O interface per processor
Chip-level integration Example:
A Possible Alpha 21364 System
Chip-level integration Example:

IBM Power 4  CMP

- Two tightly-integrated 1GHz CPU cores per 170 Million Transistor chip.
- 128KB L₁ Cache per processor
- 1.5 MB On-Chip Shared L₂ Cache
- External 32MB L₃ Cache: Tags kept on chip.
- 35 Gbytes/s Chip-to-Chip interconnects.
Chip-level integration Example:

IBM Power 4

- CPU 1: >1 GHz
- CPU 2: >1 GHz
- L2 Cache
- L3 Cache
- Main Memory
- Chip-to-chip Interconnect
- >333 MHz, >10 GBytes/s
- >500 MHz, >35 GBytes/s
- >500 MHz, Wave-pipelined Expansion Buses (>10 GBytes/s)
Chip-level integration Example:

IBM Power 4 MCM

[Diagram of IBM Power 4 MCM with labeled components such as Memory, L3, Expansion Buses, Power4, CPU1, CPU2, L2, L3 Dir, and MCM.]
Board-level integration Example: CM-5 Machine Organization

Fat Tree

Design replicated the core of a Sun SparkStation 1 workstation
System Level Integration Example:

IBM SP-2

8-16 almost complete RS6000 workstations placed in racks.
Realizing Programming Models: Realized by Protocols

CAD | Database | Scientific modeling | Parallel applications
---|---|---|---
Multiprogramming | Shared address | Message passing | Data parallel
Compilation or library | Operating systems support | Programming models
Communication abstraction | User/system boundary
Hardware/software boundary
Communication hardware
Physical communication medium
Network Transactions
Challenges in Realizing Prog. Models in Large-Scale Machines

• No global knowledge, nor global control.
  – Barriers, scans, reduce, global-OR give fuzzy global state.
• Very large number of concurrent transactions.
• Management of input buffer resources:
  – Many sources can issue a request and over-commit destination before any see the effect.
• Latency is large enough that one is tempted to “take risks”:
  – Optimistic protocols.
  – Large transfers.
  – Dynamic allocation.
• Many more degrees of freedom in design and engineering of these system.
Key Design Issue:

• How much interpretation of the message by CA without involving the CPU?

• How much dedicated processing in the CA?
Spectrum of Designs

None: Physical bit stream
  – blind, physical DMA nCUBE, iPSC, . . .

User/System
  – User-level port CM-5, *T

Remote virtual address
  – Processing, translation Paragon, Meiko CS-2

Global physical address
  – Proc + Memory controller RP3, BBN, T3D

Cache-to-cache
  – Cache controller Dash, KSR, Flash
No CA Net Transactions Interpretation: Physical DMA

- DMA controlled by regs, generates interrupts.
- Physical => OS initiates transfers.
- Send-side:
  - Construct system “envelope” around user data in kernel area.
- Receive:
  - Must receive into system buffer, since no message interpretation in CA.
nCUBE/2 Network Interface

- Independent DMA channel per link direction
  - Leave input buffers always open.
  - Segmented messages.
- Routing determines if message is intended for local or remote node
  - Dimension-order routing on hypercube.
  - Bit-serial with 36 bit cut-through.
DMA In Conventional LAN

Network Interfaces

Host Memory

NIC

Data

Addr  Len
Status  Next

Addr  Len
Status  Next

Addr  Len
Status  Next

Addr  Len
Status  Next

Addr  Len
Status  Next

Addr  Len
Status  Next

nic

TX

RX

addr

len

DMA

mem bus

IO Bus

Proc

EECC756 - Shaaban

EECC756 - Shaaban

#30 lec # 13 Spring2002 5-2-2002
User-Level Ports

- Initiate transaction at user level.
- CA interprets delivers message to user without OS intervention.
- Network port in user space.
- User/system flag in envelope.
  - Protection check, translation, routing, media access in source CA
  - User/sys check in destination CA, interrupt on system.
User-Level Network Example: **CM-5**

- Two data networks and one control network.
- Input and output FIFO for each network.
- Tag per message:
  - Index Network Interface (NI) mapping table.
- *T integrated NI on chip.
- Also used in iWARP.

<table>
<thead>
<tr>
<th>Os</th>
<th>50 cy</th>
<th>1.5 us</th>
</tr>
</thead>
<tbody>
<tr>
<td>Or</td>
<td>53 cy</td>
<td>1.6 us</td>
</tr>
<tr>
<td>interrupt</td>
<td>10us</td>
<td></td>
</tr>
</tbody>
</table>
User-Level Handlers

• Tighter integration of user-level network port with the processor at the register level.
• Hardware support to vector to address specified in message
  – message ports in registers.
• Nodes integrate communication with computation on systolic basis.
• Message data direct to register.
• Stream into memory.
Dedicated Message Processing Without Specialized Hardware Design

General Purpose processor performs arbitrary output processing (at system level)
General Purpose processor interprets incoming network transactions (at system level)
User Processor <-> Message Processor share memory
Message Processor <-> Message Processor via system network transaction

Node: Bus-based SMP

Message Processor

Network

Mem

P

User

System

MP

Mem

P

User

System

NI

rest

MP

Message Processor

EECC756 - Shaaban
• User Processor stores cmd / msg / data into shared output queue.
  – Must still check for output queue full (or make elastic).
• Communication assists make transaction happen.
  – Checking, translation, scheduling, transport, interpretation.
• Effect observed on destination address space and/or events.
• Protocol divided between two layers.
Example: Intel Paragon

- **Network**
  - **I/O Nodes**
  - **Service**
  - **I/O Nodes**
  - **Devices**
  - **Devices**

- **Mem**
  - **NI**
  - **2048 B**
  - **16**
  - **175 MB/s Duplex**
  - **400 MB/s**

- **EOP**
  - **MP handler**
  - **Var data**

- **i860xp**
  - **50 MHz**
  - **16 KB**
  - **4-way**
  - **32B Block**
  - **MESI**

- **sDMA**
  - **tDMA**

---

EECC756 - Shaaban
Message Processor Events

- User Output
- Queues
- Compute Processor
- Kernel
- System Event
- Dispatcher
- DMA done
- Send DMA
- Rcv DMA
- Rcv FIFO
  ~Full
- Send FIFO
  ~Empty
• Concurrency Intensive
  – Need to keep inbound flows moving while outbound flows stalled.
  – Large transfers segmented.
• Reduces overhead but adds latency.