Evaluation of a Multithreaded Architecture for Cellular Computing

Cyclops is a new architecture for high performance parallel computers being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP system with multiple threads of execution, embedded memory, and integrated communications
hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper we describe the Cyclops architecture and evaluate two of its new hardware features: memory hierarchy with flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI
Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high performance systems.

By: Gheorghe C. Cascaval, Jose G. Castanos, Luis H. Ceze, Monty M. Denneau, Manish Gupta, Derek Lieber, Jose Moreira, Karin Strauss, Henry S. Warren Jr

Published in: Proceedings of the Eighth International Symposium on High Performance Computer Architecture. Los Alamitos, CA, , IEEE Computer Society Press, p.311-21 in 2002

Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.

Questions about this service can be mailed to reports@us.ibm.com .