In this paper, we present a new model for resource management in a distributed memory parallel system that combines time-sharing, space-sharing and load-sharing scheduling policies in a controllable manner, and that integrates them with fault-tolerance. The model effectively supports the different scheduling requirements of diverse applications that must be supported in general-purpose parallel systems. Based on our model, we have implemented a prototype scheduler that incorporates a new space-sharing strategy and a novel gang-scheduling scheme. The scheduler is extensible, in that new scheduling paradigms and techniques can be easily incorporated. The scheduler itself is fault-tolerant and it can recover from node failures. We present performance results for the scheduler running on the IBM SP2 and a cluster of workstations using a variety of different workloads. The results show that our scheduler improves the performance of a variety of parallel applications within a cluster.
By: Nayeem Islam, Andreas Prodrodmidis, Mark Squillante, Ajei Gopal and Liana Fong
Published in: Proceedings of the 17th International Conference on Distributed Computing Systems. , IEEE, p.561-9 in 1997
Please obtain a copy of this paper from your local library. IBM cannot distribute this paper externally.
Questions about this service can be mailed to reports@us.ibm.com .