Toward a General Parallel Operating System Using Active Storage Fabrics on Blue Gene/P

We propose an architecture for a General Parallel Operating System (GPOS) to make the task of efficiently exploiting Massively Parallel Processing (MPP) machines easier. Technology trends, particularly the apparent end of semiconductor frequency scaling and the limits to symmetric multiprocessor scaling, are driving a general interest in extending the reach of MPP machines. GPOS aims to enable the reuse of parallel utilities by end users in much the same way that Unix enables reuse of serial utilities via files and pipes. If GPOS is successful, it will be possible to efficiently exploit MPP machines like Blue Gene[5] using skills ranging from programming in scripting languages to MPI. In contrast to some efforts to use distributed computing for commercial applications, GPOS leverages lessons learned about how to achieve scalability from the HPC world. Large portions of the GPOS effort involve innovative integration and configuration of preexisting, successful parallel software packages and technologies which somewhat mitigates the risk of the inherently aggressive GPOS goals. Work on GPOS has produced an early prototype and while there is clearly significant work and exploration ahead, the early results are promising.

Our work has focused on a shared storage model with embedded parallel processing which is an approach we call Active Storage Fabrics (ASF). ASF uses a Parallel In Memory Database (PIMD) to allow the persistence of structured user data between concurrent and/or successive parallel job steps. PIMD is a client/server key/value database with an interface like gdbm or BerkeleyDB. PIMD servers, running on each node of the Blue Gene/ASF partition, and accessible from both embedded and external applications using a PIMD client library. An active in-memory file system, is created by modifying GPFS to use PIMD as backing store. This enables rapid access to files including embedded application executables and their operands. Active storage embedded parallel applications run in "virtual partitions", which are mapped to a subset of the physical partition hardware resources, run concurrently or successively, and are able to share data via PIMD and/or GPFS. By allowing persistent (in-memory, between jobs), structured storage shared by embedded parallel modules, reducing overheads, and meeting standard interfaces, we will enable the integration of Blue Gene with standard enterprise and scientific IT infrastructure.

The GPOS objective is an integrated system that looks and feels to the end user on a Front End Node (FEN) like a standard general purpose server rather than an HPC supercomputer or cluster. Many users will experience GPOS as an accelerator attached to a platform they already use, for example a Unix/Linux machine running commands and libraries they already know. At the same time, it provides an environment for the skilled parallel programmer to see his work reused as easily as a Unix utility. This generality will have some cost in overhead compared to writing a single, monolithic HPC MPP solution just as Unix was less efficient than MS/DOS for some programs; our challenge is to drive those overheads down using what we have learned from scaling HPC programs. The lower the overheads, the more fine grained the parallel modularization may be and the larger the number of MPP nodes that may be used in a single work stream. Eventually, modularity and efforts to reduce overhead will allow GPOS to be used routinely to create solutions more efficiently than today's HPC or cluster programming approaches.

The remainder of this paper discusses the approach we are taking to realizing GPOS and highlights some of the areas we are exploring.

By: Blake G. Fitch, Aleksandr Rayshubskiy, T. J. C. Ward, Robert S. Germain

Published in: RC24586 in 2008

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

rc24586.pdf

Questions about this service can be mailed to reports@us.ibm.com .