Hardware and Operating System Design For a Cluster-Based Scalable Shared Memory System

        This paper describes PRISM, a hardware cache coherent distirbuted memory (DSM) architecture based on a cluster of SMP nodes, where each node runs an independent operating system kernel. The architecture relies on tight coordination between the hardware and OS to solve the problem of scalability and fault containment in such systems. For scalability, the hardware and OS interactions are highly localized to a node, even for potentially global operations such as page faults and page migration. Inherently global operation, such as the mapping of global objects, amotize their cost over a large number of pages. For fault containment, the system avoids using physical addresses as global names to prevent wild writes on a malfunctioning node from affecting other nodes. For performance, the system allows efficient and flexible software control over memory and caching behavior (e.g., CC-NUMA vs. Simple-COMA) of each page of memory. Moreover, the control is performed independently at each node.
        All of these features are provided within a single unified system design. We are currently implementing the system and find its complexity to be equivalent to that of existing hardware DSM systems. We discuss the features of SMP nodes that are vital for building efficient DSM systems out of SMP clusters. We also discuss the features of the cluster interconnect that are necessary for performance and avoiding deadlock. We present simulation results that demonstrate performance benefits of the architecture over previous designs.

By: Kattamuri Ekanadham, Mark Giampapa, Joefon Jann, Beng-Hong Lim, Pratap Pattnaik, Marc Snir, Alan Benner, Dean Liberty, David Sadler, Gautam Shah

Published in: RC21318 in 1998

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc21318.ps

Questions about this service can be mailed to reports@us.ibm.com .