Open MPI logo

Open Resilient Cluster Manager (ORCM)

  |   Home   |   Support   |   FAQ   |  

The Open Resilient Cluster Manager (ORCM, or OpenRCM) is an open-source project focused on development of an "always on" resource manager for high-performance computing systems of any size. The objectives for the system are:

  • Maintain operation of running applications in the face of single or multiple failures of any given process within that application.
  • Proactively detect incipient failures (hardware and/or software) and respond appropriately to maintain overall system operation.
  • Support both MPI and non-MPI applications.
  • Provide a research platform for exploring new concepts and methods in resilient systems.

It is expected that both open and proprietary elements will be incorporated into the ORCM system. Thus, the overall architecture of the system is built upon the ORTE and OPAL layers within the Open MPI project. Development within the ORCM effort frequently touches both communities, contributing to improved Open MPI capability as well as advanced ORCM features.

Several features distinguish OpenRCM from other common resource managers, including (but not limited to):

  • Full utilization of component architecture methods to provide a platform for research and production code to coexist and be tested in actual production environments.
  • A focus on fault prediction, integration with embedded state-of-health sensors, and proactive response to both hardware and software faults.
  • Support for dynamic resource addition/subtraction from running multi-node applications, allowing for "on-the-fly" removal and replacement of nodes without stopping applications.
  • Built-in communications library for resilient applications that automatically maintains communications in the presence of failed processes.
  • An architecture designed to support platforms ranging from small embedded multi-processor systems to large-scale high-performance computing clusters.

Current Status

OpenRCM is currently under development, with an initial release expected in early 2010. Interested parties are welcome to get a developer's checkout from our Subversion repository (sorry, no tarballs available yet). Of course, while we do our best to ensure the development trunk will always build and run, we cannot guarantee the stability of that code base. Please feel free to advise us of problems, and to offer suggestions for improvement, on the appropriate mailing list.

Instructions on how to build OpenRCM, and the required Open MPI support, are provided in the HACKERS file at the top of the OpenRCM code base.

Questions and bugs

Questions, comments, and bugs should be sent to ORCM mailing lists.

Also be sure to see the ORCM wiki and bug tracking system for information relating to ORCM's design and the project.

History / credits

OpenRCM was originally conceived and developed within Cisco Systems, Inc. as an advanced resource manager for high-performance router control systems. Given the potential cross-over application to the HPC community, and the involvement of several major universities in the project, the decision was made to release this project as open-source in the hopes that others may benefit from it and contribute to its evolution.


This Mirror is provided by AmbiWeb GmbH. Contact details can be found in the imprint / Impressum.
The Download-Server is located in central Germany. Servertime is Thursday September 2 11:00:45 EDT 2010