# IBM Research Report 

# Locally Connected Processor Arrays for Matrix Multiplication and Linear Transforms 

Chai Wah Wu<br>IBM Research Division<br>Thomas J. Watson Research Center<br>P.O. Box 704<br>Yorktown Heights, NY 10598

[^0]
# Locally Connected Processor Arrays for Matrix Multiplication and Linear Transforms 

Chai Wah Wu<br>IBM T. J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA<br>contact email: chaiwahwu@ieee.org


#### Abstract

Cellular Neural Networks is a multiprocessor computing architecture where the processors are only directly connected to nearby processors. This results in a trade off between the number of connections between processors and the number of steps needed to perform global computation. We consider such a locally connected computing architecture and present some preliminary analysis on this trade off and study this architecture's applicability to the specific problem of matrix multiplication including linear transforms applications such as 1-D and 2D DCT and DWT. We illustrate that in general there is a trade-off between the following 3 parameters: the number of iterations needed to perform the global computation, the amount of memory in each processor and the connectedness of the graph. This latter parameter is expressed as the relative diameter of the computer architecture graph with respect to the problem graph.


## I. Introduction

Cellular Neural Networks (CNN) [1] is a multiprocessor computing architecture where the processors are only directly connected to nearby processors. Each processor contains compute units and memory. At the outset, each processor's contains data pertinent to the problem at hand. This is a particularly common arrangement in CNN vision-related applications where each processor is located near a sensor and processes the sensor data [2]. How the processors are connected plays an important role in the capabilities of the system. The locality of the processors' communication means that it cannot perform global computations that require knowledge of the data in all the processors, at least not in one step. However, one can trade off space connectivity with time. Over several iterations, data dissemination from processors to neighboring processors will be able to propagate information to all the processors allowing global computation to occur.

In this paper, we look at some aspects of this trade-off by studying a general locally connected computing architecture. In particular, the feasibility of using such an architecture for performing matrix multiplication will be investigated. As an example, we illustrate the possibility of using such an architecture for performing DCT and DWT.

The question of locality in distributed algorithms was studied in [3], [4] where the focus is on graph theoretical algorithms such as graph coloring and finding the maximal independent set. Our focus here is on signal and image processing applications that CNN is especially suitable for.

## II. Notations and definitions

Definition 1 (Relative diameter): Given two graphs with the same vertex set $V$, the diameter of graph $G_{A}=$ $\left(V, E_{A}\right)$ relative to graph $G_{B}=\left(V, E_{B}\right)$ is defined as $\max _{(v, w) \in E_{B}} \operatorname{dist}_{G_{A}}(v, w)$, where $\operatorname{dist}_{G_{A}}(v, w)$ is the distance between vertices $v$ and $w$ in graph $G_{A}$. We denote this as $\operatorname{diam}_{G_{B}}\left(G_{A}\right)$.

Note that the normal definition of diameter is the same as the diameter relative to the complete graph. Also note that the diameter of a connected graph relative to any graph is finite.

Definition 2: The $r$-neighborhood of a vertex $v$ in graph $G$ (denoted as $B_{r}(v, G)$ ) is defined as the set of vertices that is at most a distance $r$ from $v$. We adopt the convention that $v \in B_{r}(v)$.

Definition 3: The matrix of a graph $(V, E)$ of order $n$ is an $n$ by $n$ matrix $A$ such that $A_{i j}=1$ if and only if $\left(v_{i}, v_{j}\right) \in E$ and 0 otherwise.
Note that the matrix of a graph depends on a specific labeling of the vertices to the integers $\{1, \cdots, n\}$.

Definition 4: The graph of a $n$ by $n$ matrix $A$ is the graph on $n$ vertices such as there is an edge between vertex $i$ and vertex $j$ if and only if $A_{i j} \neq 0$.

We consider the following locally connected computing architecture. A computer $\left(G, M,\left\{f_{i}\right\},\left\{u_{i}\right\}\right)$ consists of processors as vertices located on a graph $G=(V, E)$, where for each $i \in V$, the processor $v_{i}$ contains $M$ (local) memory locations $m_{i}^{j}, j=1,2, \ldots, M$. At each iteration $t$, each processor $v_{i}$ computes a function using the data in its memory locations and the memory locations of its neighbors and stores them in the memory locations, overwriting the previous content, i.e. it computes $m_{i}^{j}(t)=f_{i}^{j}\left(\cup_{k \in B_{1}\left(v_{i}, G\right), 1 \leq l \leq m} m_{k}^{l}(t-1), t\right)$ for each $1 \leq j \leq M$. The output of the computer is defined as $y_{i}(t)=u_{i}\left(\cup_{j} m_{i}^{j}(t)\right)$ for each $i \in V$.

## III. Trade-off between connectivity, time and MEMORY

In a CNN architecture the processors are locally connected, i.e. the graph $G$ is a locally connected graph. We would like to know what the trade-offs are in terms of connectivity of the graph, the time it takes to compute the result and the memory needed. The following simple result shows that it is possible to compute anything a more densely connected architecture can do, but at a cost of increased time and memory:

Theorem 1: Assume that $G_{A}$ and $G_{B}$ have the same vertex set $V$. Consider a computer $\mathcal{C}_{B}=\left(G_{B}, M_{B},\left\{f_{i}^{B}\right\},\left\{u_{i}^{B}\right\}\right)$
computing $\left\{y_{i}^{j}\right\}$ in one step. Then for any graph $G_{A}$ there exist a computer $\mathcal{C}_{A}=\left(G_{A}, M_{A},\left\{f_{i}^{A}\right\},\left\{u_{i}^{A}\right\}\right)$ computing $\left\{y_{i}\right\}$ in $d$ steps where $d=\operatorname{diam}_{G_{B}}\left(G_{A}\right)$ and $M_{A}=|V| M_{B} .{ }^{1}$
Sketch of Proof: Since $M_{A}=|V| M_{B}$, each processor $v$ in $\mathcal{C}_{A}$ has enough storage to store the data in all the processors of $\mathcal{C}_{B}$. For $1 \leq t<d$, let $f_{i}^{A}$ be defined as the function that propagate the data in the memory to the memory of its neighbors in the appropriate slots. At iteration $d-1$, all the data in each processor or its neighbors in $G_{B}$ is either in local memory or the memory of its neighbors in $G_{A}$. The $d$-th iteration will serve to compute the final result.

This result shows that the number of iterations needed to mimic the other machine is $d$. The discussion about the propagation of data in the proof also shows that this is the minimal number of iterations needed; there exists computations for which at least $d-1$ steps are needed to propagate the data to all the necessary nodes.

However, the requirement on the memory of $\mathcal{C}_{A}$ in Theorem 1 is quite conservative and we like to know what the minimal amount of needed "extra" memory is. In the following section, we look at this issue for the problem of matrix multiplication which include linear transforms such as DCT and DWT.

## IV. Matrix multiplication

Consider the case where each $f_{i}$ computes vector weighted sums of data in the memory locations of the processor and its neighbors and store the result into the local memory. Thus each iteration corresponds to a matrix multiplication operation with the vector of data in memory.

For the simplest case where each processor has a scalar memory location (i.e. $M=1$ ), the corresponding problem is the following: Given a matrix $A$ and a graph $G$, find a set of $d$ matrices $A_{i}$ such that $\left\|A-\pi_{i} A_{i}\right\| \leq \delta$ for some $\delta>0$ under the constraint is that the graph of each $A_{i}$ is a subgraph of $G .^{2}$ What can we say about $d$, the number of matrices in the set $\left\{A_{i}\right\}$ ? It is easy to show that in general the number $d$ needs to be at least $\operatorname{dist}_{G_{A}}(G)$ where $G_{A}$ is the graph of the matrix $A$. Furthermore, by counting the degree of freedom in $A$ and $A_{i}$ as the number of nonzero elements, we see that in general we also require $d$ to be at least $\left\lceil\frac{n(A)}{n(G)}\right\rceil$ where $n(A)$ is the number of nonzero elements in $A$ and $n(G)$ is the number of nonzero elements of the adjacency matrix of $G$. For a specific matrix $A$, this latter constraint on $d$ might not be needed.

## A. Discrete Cosine Transform (DCT)

Consider the $n$ by $n$ orthogonal matrix $C$ corresponding to the 1-D DCT of size $n: C_{1 j}=\frac{1}{\sqrt{n}} \cos \left(\frac{\pi}{n}\left(j-\frac{1}{2}\right)(i-1)\right)$ for $j=1, \cdots, n$ and $C_{i j}=\sqrt{\frac{2}{n}} \cos \left(\frac{\pi}{n}\left(j-\frac{1}{2}\right)(i-1)\right)$ for $i=2, \cdots, n, j=1, \cdots, n$.

Consider a CNN topology where the processors are arranged in a line and each processor only communicates with its

[^1]nearest neighbors. This means the the underlying graph is the path graph $P_{n}$. The graph of the matrix $C$ is the complete graph $K_{n}$. To implement this on a path graph $P_{n}$ topology would require at least $n-1$ iterations since the diameter of $P_{n}$ is $n-1 .{ }^{3}$ Let us consider the case $n=8$. For various values $k$, we use a line-search algorithm to find a set of $k$ matrices $A_{i}$ such that $\left\|C-\pi A_{i}\right\|$ is minimized with each of the graph of $A_{i}$ a subgraph of $P_{8}$. The matrix norm used will be the Frobenius norm. Figure 2 shows the minimal value of this norm found for various values of $k$ and various types of graphs. As can be seen, a transition occurs at $k=7$ for the path graph, in accordance with the above conclusion that at least 7 matrices are needed to approximate $C$.

As an illustration, the following set of 7 matrices $A_{i}$ when multiplied together generate a matrix $A_{1} A_{2} \cdots A_{7}$ that is close to the 1-D DCT matrix $C$ of order 8 . Each $A_{i}$ is tridiagonal and thus has $P_{8}$ as its graph.


[^2]Consider now processors arranged on a 2-D grid graph (Fig. 1). Since the 2-D DCT is separable into 1-D DCTs applied to the rows and columns sequentially, the result for the path graph can be applied here first row-wise and then column-wise and this shows that the 2-D DCT of order 64 can be computed on the grid graph topology in $7+7=14$ iterations. Thus we can implement an order 64 2-D DCT on a CNN architecture where each processor performs 3 multiply-and-add's in parallel and repeat for 14 iterations.


Fig. 1. 2-D 8 by 8 grid graph.
Figure 2 also shows that the star graph topology and the balanced binary tree ${ }^{4}$ topology (Fig. 3) can approximate $C$ using fewer matrices at $k=6$ and $k=5$ respectively.


Fig. 2. Minimum Frobenius norm of $C-\pi_{i} A_{i}$ for various values of $k$ and graph topologies.

This indicates that among the 3 types of spanning trees (path, star and balanced binary tree), the binary tree is a better

[^3]

Fig. 3. Balanced binary tree of 8 vertices.
topology to compute the DCT. From the degree of freedom discussion earlier, a matrix whose graph is a spanning tree will have $8+7+7=22$ nonzero entries, and thus we expect to require at least $\left\lceil\frac{64}{22}\right\rceil=3$ matrices $A_{i}$ to form an adequate approximation of $C$. Numerical experiments indicate that among all spanning trees (which necessarily have 7 edges) at least 5 matrices are needed. On the other hand, by adding a single edge to the balanced binary tree in Fig. 3, we obtain the following graph of 8 edges (Fig. 4), for which only 4 matrices are needed.


Fig. 4. A graph of 8 edges for which only 4 matrices are needed to approximate $C$.

How many edges does the graph need to have in order to require only 3 matrices to approximate $C$ ? It turns out a graph of 10 edges is sufficient. For instance, the graph in Fig. 5 obtained by adding 3 edges to Fig. 3 is such that only 3 matrices with this graph as the matrix graph are sufficient to approximate $C$. For instance the following 3 matrices when multiplied result in a matrix that is close to $C$.




Fig. 5. A graph of 10 edges for which only 3 matrices are needed to approximate $C$. The diameter of this graph is 3 .

## B. Discrete Wavelet Transform (DWT)

The number of steps needed to compute a full matrix $A$ using sparse matrices $A_{i}$ depends on the entries of $A$. Let us study another widely use linear transform and compare it with the DCT. Consider the 1-D 3-stage Cohen-DaubechiesFeauveau 9/7 discrete wavelet transform of order 8 and apply the same procedure to construct a sequence of matrices whose product approximates the DWT matrix (denoted as $W$ ). In Table I we show the number of steps needed to approximate the DCT matrix $C$ and the DWT matrix $W$ for various graph topologies:

|  | diameter | DCT matrix $C$ | DWT matrix $W$ |
| :--- | :--- | :--- | :--- |
| Path graph $P_{8}$ | 7 | 7 | 7 |
| Star graph | 2 | 6 | 7 |
| Balanced tree (Fig. 3) | 4 | 5 | 5 |
| Fig. 4 | 3 | 4 | 4 |
| Fig. 5 | 3 | 3 | 3 |

TABLE I
Number of steps to approximate the matrices $C$ and $W$ for VARIOUS GRAPH TOPOLOGIES.

We see that approximating the DWT matrix is similar to the DCT case (with the exception of the star graph), although we noticed that the optimization algorithm seems to have more difficulty finding a solution in the DWT case, suggesting that the DWT is a "harder" transform than the DCT.

## V. Trade-off between memory and number of ITERATIONS

Recall that from Theorem 1 the global computation can be performed in $d$ steps on a locally connected architecture with sufficient auxiliary memory storage. On the other hand, the numerical results above show that for the 2 graphs in Figs. $3-4$, the number of steps required is 1 more than the diameter $d$. This is because the number of memory locations in each processor is 1 , the same as the global problem of computing $C$ and presumably there is not enough degrees of freedom
for the sparse matrices to approximate $C$. If we increase the number of memory locations in each processor to 2 , and have each processor compute a weighted sum of its own data and those of its neighbors, then numerical experiments show that we could approximate the matrix $C$ in $d$ steps, the minimal required.

Consider next the star graph. This graph has diameter 2. For $M=1$, the number of steps required is 6 . Figure 6 shows that as we add memory to each processor, the number of steps needed decreases until at $M=5$ the number of steps needed is 2 , which is the minimum dictated by the diameter.


Fig. 6. Number of iterations needed to approximate $C$ versus the number of memory locations in each processor for the star graph topology.

## VI. CONCLUSIONS

We consider a locally connected computer architecture such as the CNN and study its feasibility in implementing global computations such as DCT and DWT. In particular, we show that for the 2-D DCT of order 64, processors located on a locally connected 2-D grid can approximate it in 14 iterations. In general, there is a trade-off between the following 3 parameters: the number of iterations, the amount of memory in each processor and the connectedness of the graph. This latter parameter is expressed as the relative diameter of the computer architecture graph with respect to the problem graph. More research is needed to determine the exact nature of this tradeoff and perhaps the insights gained can be used to choose the optimal topology in a CNN architecture design.

## References

[1] L. O. Chua and T. Roska, "The CNN paradigm," IEEE Transactions on Circuits and Systems-I, vol. 40, pp. 147-156, mar 1993.
[2] B. J. Sheu, K.-B. Cho, and W. C. Young, "Integration of sensor/processor under cellular neural networks paradigm for multimedia applications," in Proceedings of Fifth IEEE International Workshop on Cellular Neural Networks and Their Applications (CNNA), 1998, pp. 45-49.
[3] B. Awerbuch, M. Luby, A. Goldberg, and S. Plotkin, "Network decomposition and locality in distributed computation," in Proceedings of 30th Annual Symposium on Foundations of Computer Science (FOCS), 1989, pp. 364-369.
[4] N. Linial, "Locality in distributed graph algorithms," SIAM J. Computing, vol. 21, no. 1, pp. 193-201, 1992


[^0]:    $\overline{\overline{\underline{E}} \overline{\overline{\underline{E}}} \overline{\bar{E}}}$
    $\underline{\underline{\underline{E}}}$
    Research Division
    Almaden - Austin - Beijing - Cambridge - Haifa - India - T. J. Watson - Tokyo - Zurich

[^1]:    ${ }^{1}$ Assuming that $d$ is finite.
    ${ }^{2}$ Since each processor can access its own local memory, we assume that all the graphs $G$ we consider has self-loops added, i.e. there is an edge from each vertex to itself.

[^2]:    ${ }^{3}$ One can see that the FFT butterfly network is equivalent to the case where the coupling graph can change at each iteration. At each iteration, each graph is a disconnected set of $n / 2$ edges. Because of the time-varying nature, each node can connect to every other node in $\log _{2}(n)$ iterations.

[^3]:    ${ }^{4}$ defined as a binary tree with the smallest diameter.

