## CSE 4351 Notes 5: Interconnection Networks and Routing

Two Major Approaches to Processor/Memory Interconnects:


Indirect (Dynamic)


Direct (Static)

Graph-theoretic Properties:
Bisection width = minimum number of "wires" that must be removed to get two "halves"
Diameter $=$ maximum distance between any pair of processors
Bounded degree $=$ all networks in the class have vertex-degree bounded by a constant
Symmetry = number of automorphisms, number of vertex equivalence classes (vecs)
Rules of Thumb: High bisection width and low diameter are good. Bounded degree networks are easier to build. High symmetry - simpler code, complicated to build

Linear array


Bisection width $=1$
Diameter $=\mathrm{N}-1$
Automorphisms $=2$, vecs $=\operatorname{ceil}(\mathrm{N} / 2)$
Convenient for VLSI signal processing and bit-level integer multiplication. The convolution of two polynomials $h(x)=$ $f(x) g(x)$ of $f(x)=a_{0}+a_{1} x+a_{2} x^{2}+\ldots+a_{n-1} x^{n-1}$ and $g(x)=b_{0}+b_{1} x+b_{2} x^{2}+\ldots+b_{n-1} x^{n-1}$ may be computed on a $2 n-1$ node linear array by inputting $\mathrm{a}_{\mathrm{n}-1}, \mathrm{a}_{\mathrm{n}-2}, \mathrm{a}_{\mathrm{n}-3} \ldots \mathrm{a}_{0}$ into the left end at the odd numbered steps and $\mathrm{b} 0, \mathrm{~b} 1, \mathrm{~b} 2 \ldots \mathrm{~b}_{\mathrm{n}-1}$ at the odd numbered steps. When an a and a b value arrive at a node, the values are multiplied and added to the value which will be output as a coefficient for $\mathrm{h}(\mathrm{x})$

Trace for $\mathrm{n}=4$ :


## INCREDIBLY EASY TO PERFORM ROUTING!!!

Ring - symmetric version of linear array. To avoid long wraparound connection, the following arrangement may be used:


Bisection width $=2$, Diameter $=\mathrm{N} / 2$, Automorphisms $=2 \mathrm{~N}$, vecs $=1$
Mesh


2-d most common, can have higher dimension
Bisection width $=\min (\mathrm{k}, \mathrm{N})$ if $\max (\mathrm{k}, \mathrm{N})$ is even and $\min (\mathrm{k}, \mathrm{N})+1$, otherwise. Diameter $=\mathrm{k}+\mathrm{N}-2$
Bisection width for higher dimension array:
Suppose dimensions are $2 \leq \mathrm{N}_{1} \leq \mathrm{N}_{2} \leq \ldots \leq \mathrm{N}_{\mathrm{r}}$.
If $\mathrm{N}_{\mathrm{r}}$ is even, then bisection width is $\mathrm{N}_{1} \bullet \mathrm{~N}_{2} \bullet \ldots \bullet \mathrm{~N}_{\mathrm{r}-1}$
If $\mathrm{N}_{\mathrm{r}}$ is odd, then bisection width is $\mathrm{N}_{1} \bullet \mathrm{~N}_{2} \bullet \ldots \bullet \mathrm{~N}_{\mathrm{r}-1}+$ bisection width of array $\mathrm{N}_{1} \times \mathrm{N}_{2} \times \ldots \times \mathrm{N}_{\mathrm{r}-1}$ Automorphisms $=8$ and vecs $\approx \mathrm{N}^{2} / 8$ when $\mathrm{N}=\mathrm{k}$


ROUTING IS MORE DIFFICULT THAN LINEAR ARRAY. BOTTLENECKS AROUND CENTER OF MESH.
Sorting can be used to handle routing (along with "perfect matching" in graphs). Can also use randomized approaches such as Valiant-Brebner routing.

2-d Torus - adds wrap-around connections between $\mathrm{P}(\mathrm{i}, \mathrm{N}-1)$ and $\mathrm{P}(\mathrm{i}, 0)$, also $\mathrm{P}(\mathrm{k}-1, \mathrm{j})$ and $\mathrm{P}(0, \mathrm{j})$.
N -by- N torus $(\mathrm{N}>4)$ has bisection width $=$ if N even then 2 N else $2(\mathrm{~N}+1)$,
diameter $=$ if N even then N else $\mathrm{N}-1,8 \mathrm{~N}^{2}$ automorphisms, 1 vec
Complete Binary Tree


Bisection Width $=1$ Diameter $=2 * \lg (\mathrm{~N}+1)-2$ Automorphisms $=2^{2^{\mathrm{h}}-1}$, vecs $=\mathrm{h}+1$ (h=height, which is 3 in example)
Often the internal nodes are used just for communication
Leaf selection - number leaves from 0 to $2^{\mathrm{k}}-1$, can switch path to leaf by sending MSB first followed by decreasing significance bits.

## Butterfly

$2^{\mathrm{k}}$ rows and $\mathrm{k}+1$ columns
Processors in each row are connected as a linear array
Bisection width $=2^{\mathrm{k}}$ (just remove highest dimension edges)
Automorphisms $=2^{2^{\mathrm{k}+1}-1} \quad$ vecs $=(\mathrm{k}+1) / 2$
Processor $P[i, j]$ is also connected to processor in next column via complementing the $(j+1)$-st most significant bit


Fat tree - generalization of the butterfly that has been used in several commercial systems.
Each non-root node has two parents, but each non-leaf node has degree children.
The depth of the fat tree indicates the number of levels past the root level (level 0).
Level $\mathrm{i}(0 \leq \mathrm{i} \leq$ depth $)$ has vertices labeled $(\mathrm{i}, \mathrm{j}, \mathrm{k})$ where $0 \leq \mathrm{j}<$ degree $^{\mathrm{i}}$ and $0 \leq \mathrm{k}<2^{\text {depth- }}$.

Non-leaf vertex ( $\mathrm{i}, \mathrm{j}, \mathrm{k}$ ), $\mathrm{i}<$ depth has children $(\mathrm{i}+1, \mathrm{j} \bullet$ degree $+\mathrm{p}, \mathrm{k} / 2$ ) where $0 \leq \mathrm{p}<$ degree.
Example: depth $=3$ and degree $=2$ :


Example: depth $=4$ and degree $=2$ :


Hypercube
Squashes together each row of butterfly, giving $2{ }^{\mathrm{k}}$ processors
Each processor is connected to k others - by complementing exactly one bit



Note: Cube connection results from replacing each vertex by a ring of $k$ vertices
Bisection width $=2^{\mathrm{k}-1}$ Automorphisms $=\mathrm{k}!2^{\mathrm{k}}$ vecs $=1$
Derivation of automorphisms:

1. Any one of the $2^{\mathrm{k}}$ vertices may be mapped to vertex 0 .
2. Any dimension is equivalent to any other, since there are $k$ dimensions there are $k$ ! permutations.

Hamiltonian cycle of processors - use reflected Gray code (address $=\mathrm{i}$ xor ( $\mathrm{i} \gg 1$ ) trick)

| 1-bit: | 2-bits: | 3-bits: | 4-bits: |
| :---: | :---: | :---: | :---: |
| 0 | 00 | 000 | 0000 |
| 1 | 01 | 001 | 0001 |
|  | ---- | 011 | 0011 |
|  | 11 | 010 | 0010 |
|  | 10 | -------- | 0110 |
|  |  | 110 | 0111 |
|  |  | 111 | 0101 |
|  |  | 101 | 0100 |
|  |  | 100 | ---------- |
|  |  |  | 1100 |
|  |  |  | 1101 |
|  |  |  | 1111 |
|  |  |  | 1110 |
|  |  |  | 1010 |
|  |  |  | 1011 |
|  |  |  | 1001 |
|  |  |  | 1000 |

Benes Network: A multi-stage switching network variation of the butterfly/hypercube with $(2 \mathrm{k}+1) 2^{\mathrm{k}}$ binary switches. We look at this network to get an initial understanding of static (permutation) routing and then examine the same concepts for hypercubes.


## Commonly written as



Switches can be set to give an "edge-disjoint" path for any permutation. For permutation $\left(\begin{array}{lllllll}0 & 1 & 2 & 3 & 4 & 5 & 6 \\ 5 & 3 & 4 & 7 & 0 & 1 & 2\end{array}\right)$ we get:


The existence of a switch setting for each permutation can be proven by viewing the Benes network as a recursive structure:


If we can successfully route a n-permutation problem to the upper and lower networks, we then have two $\mathrm{n} / 2$-permutation problems. 2-permutations are trivially routed. n-permutation problems are always solvable based on Hall's Matching Theorem:

A 2 N -node bipartite graph $\mathrm{G}=(\mathrm{U}, \mathrm{V}, \mathrm{E})$ has a perfect matching if and only if for all subsets $\mathrm{S} \subseteq \mathrm{U},|\mathrm{N}(\mathrm{S})| \geq|\mathrm{S}|$, where $\mathrm{N}(\mathrm{S})$ denotes the nodes in V that are adjacent to a node in S .

Corollary: If all vertices in a bipartite graph are incident on $k$ edges, then there are $k$ disjoint perfect matchings. (The set of $k$ matchings may not be unique. The number of perfect matchings is known as the permanent of the binary adjacency matrix for the graph.)

Translation:
bipartite: two-colorable
U : Nodes of color 1
V: Nodes of color 2
Perfect matching: Set of edges that will include all vertices, but no vertex is on two of the edges
Condition will be satisfied by routing graph, since each node is incident to two edges.
All "outside" switches have two packets to be routed to the other side.
Example: $\left(\begin{array}{lllllll}0 & 1 & 2 & 3 & 4 & 5 & 6 \\ 5 & 3 & 4 & 7 & 0 & 1 & 2\end{array}\right)$


This gives (dark lines in graph correspond to using upper network)

and an upper routing problem of $\left(\begin{array}{llll}0 & 4 & 3 & 6 \\ 5 & 0 & 7 & 2\end{array}\right)$ and a lower routing problem of $\left(\begin{array}{llll}1 & 5 & 2 & 7 \\ 3 & 1 & 4 & 6\end{array}\right)$

The matching problems for these are:


The recursive structure views these two problems as separate, but there is no difficulty in solving them as a single bipartite matching problem. We now have:


The remaining four subnetworks (single switches) are trivial.

## FINDING A MAXIMUM BIPARTITE MATCHING

Even though trial-and-error is usually sufficient for small problems, the notion of an alternating path facilitates the task. Hopcroft and Karp developed an extremely efficient depth-first-search algorithm of this concept that runs in $\mathrm{O}(|\mathrm{E}| \sqrt{ }(|\mathrm{V}|))$ time, but it is not suitable for "hand tracing". (Even more remarkably, Micali and Vazirani obtained the same bound for matching in general graphs).

The algorithm is based on incrementally increasing the size of the matching. The algorithm starts by using an initial deficient matching (a single edge is fine or we may greedily attempt to insert each edge in the matching without backtracking by removing a vertex). We then search for a path with the following properties:

1. The starting and terminating vertices are different and are not included in the previous matching.
2. The path alternates between $\mathrm{k}+1$ edges that are not in the matching and k edges that are in the matching, i.e.

3. The new matching is obtained by removing the edges from the previous matching that are on the alternating path and then including the alternating path edges that were not in the previous matching.

NOTE: Often a simple alternating path is just a single edge between two vertices not in the previous matching!
Sequential code and sample input files are in files bipartiteMatch*.
Example:



The same ideas apply to a butterfly/hypercube, but will go both ways on an edge at different times:


Pairs of processors act as switches in each layer. First (leftmost) level: $(0,4),(1,5),(2,6),(3,7)$ Second layer: $(0,2),(1,3),(4,6),(5,7)$ Interpretation:


Permutation routing is again based on perfect matching. In each pair, one processor takes upper, the other takes lower. Gives a vertex-disjoint path.

Example: Route $\left(\begin{array}{lllllll}0 & 1 & 2 & 3 & 4 & 5 & 6 \\ 3 & 7 & 6 & 0 & 2 & 1 & 4\end{array}\right)$

## Routing Graph:



Paths:

Upper routing problem is: $\left(\begin{array}{llll}0 & 5 & 2 & 3 \\ 3 & 1 & 6 & 0\end{array}\right)$. Lower routing problem is: $\left(\begin{array}{lll}4 & 1 & 6 \\ 2 & 7 \\ 2 & 7 & 4\end{array} 5\right)$.

## Routing graphs:



Corresponding paths:


Final paths are trivial:


Static Routing for Meshes
Given an arbitrary permutation for packets on a 2-d mesh, perfect matching (as in hypercubic networks) may be used to achieve a static routing. The algorithm will give a $3 \mathrm{~N}-3$ step strategy for routing a given permutation on a $\mathrm{N} x \mathrm{~N}$ mesh. The result can be generalized to multidimensional meshes. The algorithm has three phases, only the first phase must be precomputed:

Phase 1: Permute the packets within each column so that at most one packet in each row is destined for each column via perfect matching

Phase 2: Route each packet within its row to the correct column
Phase 3: Route each packet within its column to its final destination
Example:
Packets are initially located as follows, with destination indicated:

|  | 0 | 1 | 2 |
| :---: | :---: | :---: | :---: |
| 0 | 2,1 | 1,2 | 0,2 |
| 1 | 2,2 | 0,0 | 1,0 |
| 2 | 0,1 | 2,0 | 1,1 |
|  |  |  |  |

The routing graph has one edge per packet, based on the starting column and the destination column for each:


Three perfect matchings are then derived. All packets in the ith matching are routed to row i in the first phase.


Match 1


Match 2


Match 3

Thus, phase 1 will give:

|  | 0 | 1 | 2 |
| :---: | :---: | :---: | :---: |
| 0 | 2,1 | 1,2 | 1,0 |
| 1 | 0,1 | 0,0 | 0,2 |
| 2 | 2,2 | 2,0 | 1,1 |
|  |  |  |  |

Phase 2 routes within rows according to column destinations:

|  | 0 | 1 | 2 |
| :---: | :---: | :---: | :---: |
| 0 | 1,0 | 2,1 | 1,2 |
| 1 | 0,0 | 0,1 | 0,2 |
| 2 | 2,0 | 1,1 | 2,2 |
|  |  |  |  |

Phase 3 routes within columns according to row destinations:

|  | 0 | 1 | 2 |
| :---: | :---: | :---: | :---: |
| 0 | 0,0 | 0,1 | 0,2 |
| 1 | 1,0 | 1,1 | 1,2 |
| 2 | 2,0 | 2,1 | 2,2 |
|  |  |  |  |

## Greedy ("Shortest Path") Routing

Routing decisions are made on-the-fly in a simple way
Linear Array - Completely solved by greedy approach
Special case - each processor is the destination for a single packet, but a processor may be the source for multiple packets

Inverse problem - each processor is the source for a single packet, but a processor may be the destination for multiple packets

Modification - always choose the packet that has the farthest to go, still takes no more than n-1 steps


Mesh (n-by-n)
Algorithm: Linear routing in x -dimension, followed by linear routing in y -dimension:

Queueing: For a given edge, choose the packet which must go the farthest in that dimension
Routing for y -dimension takes $\mathrm{n}-1$ steps using result from linear arrays.
The linear array result also applies when each node starts with one packet, but a node may receive multiple packets - like in the x -dimension

Takes $2 \mathrm{n}-2$ steps, but may queue almost $2 \mathrm{n} / 3$ packets, consider the situation:


Each of the three portions of the first two rows has about $n / 3$ elements destined for column $n / 3$. At mesh node ( $2, \mathrm{n} / 3$ ), for each element that leaves (out the bottom) two additional elements will be queued

## Hypercube

Algorithm: Processor scans low-order to high-order comparing processor address and the destination address until the a mismatch is found, then send out that edge

Example: Suppose processors with addresses of form $0 \ldots 0,\langle x\rangle$ send to processors with addresses of form < $\mathrm{y}>, 0 \ldots 0(0 \ldots 0,<\mathrm{x}>$, and < $\mathrm{y}>$ each have k bits).

PROBLEM: All $\sqrt{ } \mathrm{N}$ packets must traverse processor $0 . . .0,0 \ldots 0$
General solution to congestion: Valiant-Brebner Routing

1. Packet is greedily routed to a random intermediate destination.
2. Packet is greedily routed from intermediate destination to the real destination.

## Switching Techniques

Circuit switching: Physical switches are set to reserve entire path for full bandwidth.
Store-and-forward (packet switching): A packet of a message is forwarded to next processor on path only after the entire packet has been received. Path is not established up front.

Virtual cut-through: If next router along path has input buffer space, then the flits in a packet will be pipelined. Otherwise, there is sufficient space to store an entire packet.

Wormhole routing: Uses pipelining, but has smaller buffers and uses flow-control to stall the flits of a packet, possibly among several routers. Prone to deadlock, so physical channels are divided into virtual channels that allow sufficient progress as long as cycle(s) of virtual channels are avoided.

Broadcasting Models:
One-to-all (ordinary broadcast) /All-to-all ('‘gossiping’’) / Personalized (individualized messages)
Single-port/Multi-port - degree of communication concurrency for a processor
Mono-directional/Bi-directional - meaning of an edge

Example: One-to-all broadcast on k-dimensional hypercube with node 0 as root.
Diameter is lower bound on number of rounds.

```
rootWork=0;
for (receiveDim=0; receiveDim < k; receiveDim++)
{
    if bit receiveDim of rank == 1
        Set bit receiveDim of rootWork
    if rootWork==rank
        break;
}
for (i=0; i < k; i++)
    if i\geq receiveDim
        if bit i of processorRank is 0
            Send data value to node rank +2 }\mp@subsup{2}{}{\textrm{i}
        else
            Receive data value from node rank - 2 }\mp@subsup{}{}{\textrm{i}
```

Easily adapted for arbitrary processor to be the root.
Example: All-to-all broadcast on k-dimensional hypercube using all links simultaneously
Lower bounds:

1. Diameter of hypercube $=\mathrm{k}$
2. In a given round, a processor may receive up to $k$ messages. Each processor needs to receive $2^{k}-1$ messages, so at least ceiling $\left(\left(2^{\mathrm{K}}-1\right) / \mathrm{k}\right)$ rounds are needed.

Bound 2 is more significant
$\mathrm{k} \quad \operatorname{ceiling}\left(\left(2^{\mathrm{k}}-1\right) / \mathrm{k}\right)$
$2 \quad 2$
$3 \quad 3$
$4 \quad 4$
$5 \quad 7$
$6 \quad 11$
$7 \quad 19$
$8 \quad 32$
$9 \quad 57$
$10 \quad 103$
$11 \quad 187$
$12 \quad 342$
Algorithm: Design a restricted one-to-all broadcast tree that can be easily replicated with any processor as the root of the broadcast. The algorithm will use no more than one link in each dimension in each round of communication..

First, we construct a graph that uses the necklace notion to group together processors whose addresses are the same under bitwise rotation. Two necklaces will be connected by an edge if some processor for one necklace is adjacent to some processor for the other necklace. For example, we use $\mathrm{k}=6$ :


A broadcast tree is designed for the value at processor 000000 by using depth-first search (breadth-first is also fine) to navigate among necklaces with k processors:


These 9 edges give the first 9 rounds of communication. The underlined bits emphasize that only one link for each dimension is used in each round.

| Round 1: | $000000 \rightarrow 00000 \underline{1}$ |
| :---: | :---: |
|  | $000000 \rightarrow 0000 \underline{10}$ |
|  | $000000 \rightarrow 000 \underline{100}$ |
|  | $000000 \rightarrow 001000$ |
|  | $000000 \rightarrow 010000$ |
|  | $000000 \rightarrow \underline{100000}$ |
| Round 2: | $000001 \rightarrow 0000 \underline{11}$ |
|  | $000010 \rightarrow 000 \underline{110}$ |
|  | $000100 \rightarrow 001100$ |
|  | $001000 \rightarrow 0 \underline{11000}$ |
|  | $010000 \rightarrow \underline{110000}$ |
|  | $100000 \rightarrow 10000 \underline{1}$ |
| Round 3: | $000011 \rightarrow 000 \underline{111}$ |
|  | $000110 \rightarrow 00 \underline{1110}$ |
|  | $001100 \rightarrow 0 \underline{11100}$ |
|  | $011000 \rightarrow \underline{111000}$ |
|  | $110000 \rightarrow 11000 \underline{1}$ |
|  | $100001 \rightarrow 1000 \underline{1}$ |
| Round 4: | $000111 \rightarrow 0 \underline{10111}$ |
|  | $001110 \rightarrow \underline{101110}$ |
|  | $011100 \rightarrow 01110 \underline{1}$ |
|  | $111000 \rightarrow 111010$ |
|  | $110001 \rightarrow 110101$ |
|  | $100011 \rightarrow 10 \underline{1011}$ |
| Round 5: | $010111 \rightarrow 01 \underline{1111}$ |
|  | $101110 \rightarrow 1 \underline{11110}$ |
|  | $011101 \rightarrow \underline{11101}$ |
|  | $111010 \rightarrow 11101 \underline{1}$ |
|  | $110101 \rightarrow 1101 \underline{1}$ |
|  | $101011 \rightarrow 101 \underline{11}$ |
| Round 6: | $011111 \rightarrow 0 \underline{0} 1111$ |
|  | $111110 \rightarrow \underline{0} 11110$ |
|  | $111101 \rightarrow 11110 \underline{0}$ |
|  | $111011 \rightarrow 1110 \underline{0} 1$ |
|  | $110111 \rightarrow 110 \underline{0} 11$ |
|  | $101111 \rightarrow 10 \underline{0} 111$ |
| Round 7: | $001111 \rightarrow 0011 \underline{1}$ |
|  | $011110 \rightarrow 011 \underline{10}$ |
|  | $111100 \rightarrow 11 \underline{100}$ |
|  | $111001 \rightarrow 1 \underline{1001}$ |
|  | $110011 \rightarrow \underline{0} 10011$ |
|  | $100111 \rightarrow 10011 \underline{0}$ |
| Round 8: | $001101 \rightarrow 00 \underline{0} 101$ |
|  | $011010 \rightarrow 0 \underline{0} 1010$ |
|  | $110100 \rightarrow \underline{0} 10100$ |
|  | $101001 \rightarrow 10100 \underline{0}$ |

$$
\begin{aligned}
& 010011 \rightarrow 0100 \underline{0} 1 \\
& 100110 \rightarrow 100 \underline{0} 10 \\
& \text { Round } 9: 000101 \rightarrow \underline{100101} \\
& 001010 \rightarrow 00101 \underline{1} \\
& 010100 \rightarrow 0101 \underline{1} 0 \\
& 101000 \rightarrow 101 \underline{1} 00 \\
& 010001 \rightarrow 01 \underline{1} 001 \\
& 100010 \rightarrow 1 \underline{1} 0010
\end{aligned}
$$

In rounds 10 and 11, the algorithm 'cleans up' for the necklaces with < k processors:

| Round 10: | $001011 \rightarrow 0010 \underline{0} 1$ | Necklace 001011 to necklace 001001 |
| :---: | :---: | :---: |
|  | $010110 \rightarrow 010 \underline{0} 10$ |  |
|  | $101100 \rightarrow 10 \underline{0} 100$ |  |
|  | $001011 \rightarrow 0 \underline{1011}$ | Necklace 001011 to necklace 011011 |
|  | $010110 \rightarrow \underline{1} 10110$ |  |
|  | $101100 \rightarrow 10110 \underline{1}$ |  |
| Round 11: | $000101 \rightarrow 0 \underline{10101}$ | Necklace 000101 to necklace 010101 |
|  | $001010 \rightarrow \underline{101010}$ |  |
|  | $111110 \rightarrow 111111$ | Necklace 011111 to necklace 111111 |

Several details that guarantee the success of the clean-up rounds have been omitted. These rounds are only needed when k is not prime. There is much flexibility in generating a solution, i.e. there are many alternate schemes.

The last detail is to replicate for broadcasts for other processors besides 000000 . This is easily done by taking the bits in the address of the broadcasting processor and exclusive or'ing this address onto the sending and receiving address for each of the transmissions in all rounds.

The end result is that all links will be busy in every round except perhaps the last one.

## Sorting Techniques:

## $\underline{\text { Odd-Even Transposition Sort }}$

Uses $\mathrm{n} / 2$ rounds (each with two transposition steps) to sort n values on linear array.

$$
\begin{aligned}
& 1 \leftrightarrow 67 \leftrightarrow 53 \leftrightarrow 42 \leftrightarrow 0 \\
& 16 \leftrightarrow 57 \leftrightarrow 34 \leftrightarrow 02 \\
& 1 \leftrightarrow 56 \leftrightarrow 37 \leftrightarrow 04 \leftrightarrow 2 \\
& 15 \leftrightarrow 36 \leftrightarrow 07 \leftrightarrow 24 \\
& 1 \leftrightarrow 35 \leftrightarrow 06 \leftrightarrow 27 \leftrightarrow 4 \\
& 13 \leftrightarrow 05 \leftrightarrow 26 \leftrightarrow 47 \\
& 1 \leftrightarrow 03 \leftrightarrow 25 \leftrightarrow 46 \leftrightarrow 7 \\
& 01 \leftrightarrow 23 \leftrightarrow 45 \leftrightarrow 67 \\
& 01234567
\end{aligned}
$$

## Mesh Sorting

Unlike other topologies, there is no single "obvious" desirable way to place an ordered sequence onto a mesh for 2 dimensions, much less higher dimensions. It is usually taken that given a good way to obtain a particular arrangement, it is not a problem to permute the arrangement (based on static routing for meshes, discussed earlier). The following "shearsort" algorithm for 2-d meshes is practical (uses odd-even transposition in rows or columns), but not optimal (misses by a logarithmic factor). The algorithm takes $\mathrm{N}(2 \log \mathrm{~N}+1$ ) steps where the mesh is Nx N . The output is in "snakelike" ordering for rows.

Phases $1,3,5, \ldots, 2 \log (N)+1$ sort all rows (in $O(N)$ parallel time for each phase)
Odd rows are sorted so that smaller numbers are at the left. Even rows are sorted so that smaller numbers are at the right.
Phases $2,4,6, \ldots, 2 \log (N)$ sort all columns (in $O(N)$ parallel time for each phase)
Columns are sorted with smaller numbers at the top
Example:

| 10 | 2 | 12 | 8 |
| :---: | :---: | :---: | :---: |
| 16 | 5 | 1 | 14 |
| 3 | 9 | 7 | 13 |
| 6 | 15 | 4 | 11 |


| 2 | 8 | 10 | 12 |
| :---: | :---: | :---: | :---: |
| 16 | 14 | 5 | 1 |
| 3 | 7 | 9 | 13 |
| 15 | 11 | 6 | 4 |


| 2 | 7 | 5 | 1 |
| :---: | :---: | :---: | :---: |
| 3 | 8 | 6 | 4 |
| 15 | 11 | 9 | 12 |
| 16 | 14 | 10 | 13 |


| 1 | 2 | 5 | 7 |
| :---: | :---: | :---: | :---: |
| 8 | 6 | 4 | 3 |
| 9 | 11 | 12 | 15 |
| 16 | 14 | 13 | 10 |


| 1 | 2 | 4 | 3 |
| :---: | :---: | :---: | :---: |
| 8 | 6 | 5 | 7 |
| 9 | 11 | 12 | 10 |
| 16 | 14 | 13 | 15 |


| 1 | 2 | 3 | 4 |
| :---: | :---: | :---: | :---: |
| 8 | 7 | 6 | 5 |
| 9 | 10 | 11 | 12 |
| 16 | 15 | 14 | 13 |

Why does it work? The algorithm is oblivious (since odd-even transpose is also oblivious.). By the $0-1$ sorting lemma, if an oblivious algorithm will correctly sort any input sequence with 0 s and 1 s , then the algorithm will sort any sequence correctly. (The proof of the $0-1$ sorting lemma shows that if an oblivious sort fails on some sequence, then there exists a sequence of 0 s and 1 s that will also make the algorithm fail.)

## Problems:

1. Route the following permutation on a Benes network and a hypercube (e.g. butterfly):

$$
\left(\begin{array}{llllllll}
0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\
4 & 5 & 6 & 7 & 0 & 1 & 2 & 3
\end{array}\right)
$$

2. Use Gray codes to show that a 64-node hypercube is isomorphic to a $4 \times 4 \times 4$ torus.
3. Show how to achieve the following mesh permutation using perfect matching and linear array sorting.

4. Derive an all-to-all broadcast scheme for a 4-d hypercube similar to the one used for 6-d hypercubes.
5. How many automorphisms are there for a complete binary tree with $\mathrm{h}=4$ ?
6. How many vertex equivalence classes does a $4 \times 5$ torus have? A $4 \times 5$ mesh?
7. How many automorphisms does a 5-node ring have?
8. For purposes of this exercise only, let us define a class of networks to be scalable if a larger network in that class may be constructed from smaller network(s) of that class by only including additional vertices and edges. In particular, the drastic measures of removing edges or vertices are prohibited. Indicate which network classes are scalable and which are not.
9. Consider an r-dimension array with $\mathrm{N}=\mathrm{N}_{1}=\mathrm{N}_{\mathrm{r}}$ and N is odd. What is the bisection width?
10. Draw the fat tree with depth $=2$ and degree $=3$.
11. Route the following permutation on a Benes network and a hypercube (e.g. butterfly):

$$
\left(\begin{array}{llllllll}
0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\
4 & 5 & 6 & 7 & 0 & 1 & 2 & 3
\end{array}\right)
$$



(ल)
(2)
(1) $0^{2}$


2. Use Gray codes to show that a 64 -node hypercube is isomorphic to a $4 \times 4 \times 4$ torus.

Each torus node has a three component address ( $\mathrm{x}, \mathrm{y}, \mathrm{z}$ ) where each component value is $0,1,2$, or 3 . To map to a hypercube, each of the three components is mapped to two bits in the hypercube address using the 2-bit Gray code:

| torus address <br> component | hypercube |
| :---: | :---: |
| 0 | 00 |
| 1 | 01 |
| 2 | 11 |
| 3 | 10 |

To see that adjacencies are preserved, consider torus node $(1,2,3)$ and its neighbors:
torus address hypercube address

| $(1,2,3)$ | 011110 |
| :--- | :--- |
| $(0,2,3)$ | $0 \underline{0} 1110$ |
| $(2,2,3)$ | $\underline{111110}$ |
| $(1,3,3)$ | $011 \underline{0} 10$ |
| $(1,1,3)$ | $01 \underline{0} 110$ |
| $(1,2,2)$ | $01111 \underline{1}$ |
| $(1,2,0)$ | $0111 \underline{0}$ |

3. Show how to achieve the following mesh permutation using perfect matching and linear array sorting.


Start
Column


Matching 3


Matching 4

| 3,1 | 4,2 | 2,3 | 1,4 | 3,1 | 4,2 | 2,3 | 1,4 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 3,2 | 1,3 | 3,4 | 2,1 | 2,1 | 3,2 | 1,3 | 3,4 |
| 4,3 | 2,4 | 1,1 | 1,2 | 1,1 | 1,2 | 4,3 | 2,4 |
| 4,4 | 3,3 | 2,2 | 4,1 | 4,1 | 2,2 | 3,3 | 4,4 |

Sort within rows based on column destination


Sort within columns based on column destination
4. Derive an all-to-all broadcast scheme for a 4-d hypercube similar to the one used for 6-d hypercubes.

Lower bounds indicate that 4 rounds are sufficient.

5. How many automorphisms are there for a complete binary tree with $\mathrm{h}=4$ ?


Since each parent node has two orientations for its children, there are $2^{15}=32768$ automorphisms
6. How many vertex equivalence classes does a $4 \times 5$ torus have? A $4 \times 5$ mesh?

Torus: 1
Mesh: 6
7. How many automorphisms does a 5-node ring have?

10
8. ... Indicate which network classes are scalable and which are not.

Scalable: linear array, mesh, binary tree, butterfly, fat tree, hypercube, benes
Not scalable: ring, torus
9. Consider an r-dimension array with $\mathrm{N}=\mathrm{N}_{1}=\mathrm{N}_{\mathrm{r}}$ and N is odd. What is the bisection width? $\left(\mathrm{N}^{\mathrm{r}}-1\right) /(\mathrm{N}-1)$ by using a geometric $\operatorname{sum}\left(1+\mathrm{N}^{1}+\mathrm{N}^{2}+\ldots+\mathrm{N}^{\mathrm{r}-1}\right)$.
10. Draw the fat tree with depth $=2$ and degree $=3$.


