The official, permanent reference for this site is: "XGrid agent for Unix architectures" available at http://www.novajo.ca/xgridagent/.
In January 2004, Apple released XGrid, a simple system for setting up and using a cluster of OS X machines. It is very simple to use compared to other grid or cluster systems and reduces the learning curve for performing cluster computation. The appeal of XGrid is that it shields the end user from the details of the cluster. What is missing to make it even more powerful is an agent for architectures other than Mac OS X (the agent in Xgrid terminology is the computer performing the computation). This is important since the computer infrastructure available to scientists is not always based on Mac OS X, and universities have a significant investment in various Unix platforms that should not be neglected when running computations in clusters. This article introduces the first working Xgrid agent for Linux and other Unix systems that can be integrated in any XGrid cluster (managed by OS X). The agent will compile and work on Linux (at least Debian and RedHat), Solaris (minimal testing) and Darwin (tested). You still need an OS X machine for the controller and for using the actual XGrid (with XGrid.app). Also, the user currently needs to "be aware" that the cluster is multi-architecture (since the XGrid controller actually does not know). Examples are provided to show you how to deal with this.
This article is separated in various sections:
There are other articles about Xgrid and POV-ray here.. For comments, here is my contact info at the Ontario Cancer Institute, (University of Toronto), Biophotonics group.
Necessary requirements to compile the agent:
I will give instructions to compile everything in your home directory tree (that is: you don't need root privileges). I haven't encountered any problems myself but let me know if you do.
If libxml2 is not installed on your system (check with xml2-config --libs), then you can install and compile it the standard way:
or with DarwinPort sudo port install libxml2.
or with DarwinPort: sudo port install glib2
roadrunner is not available with darwinport.
Thanks to Chris Baker for a gcc 2.95.x patch and Justin Gullingrud for the RedHat9 libxml2 patch.
You will get one warning: /home/dccote/xgridagent-rr/xgridagent.c:346: warning: the use of `tmpnam' is dangerous, better use `mkstemp'. Don't worry for now: that's the least of your problems. Don't run the agent as root: run as a regular user because there are a lot of vulnerabilities in the code. Try to run the agent with:
to connect to a controller (you can start a controller by hand in the terminal of an OS X machine with /usr/libexec/xgrid/GridServer). You will get a lengthy, verbose description of what the agent is doing. Adjust the message level in xgrid.config.xml. You must not be using XGrid passwords neither on the controller or with agents (not implemented yet, although I don't think it is hard). You can then connect to the controller using the XGrid.app application, and start testing your cluster with Linux agents (limitations, see below).
Several notes on compilation:
The agent will load most of its configuration parameters from xgrid.config.xml. You may modify it at will. The program will write a file called cookie to reuse the same cookie between calls. The actual tasks run in "/tmp/filexxxx/"
When you open XGrid.app, you should obtain something along the lines of:
where the cluster on the picture is making use of three Linux machines, in addition to an OS X agent running as dccote. In your case, you will highly likely have only one Linux agent.
The Shell Xgrid plug-in will simply call a shell command (regardless of where it is in the execution path on the agent). For instance, on a cluster with a single Linux agent, one obtains the following result with uname -a:
You may try the XFeed plug-in to send a range of arguments to a command, but because of the way Xgrid.app is working, the command's path must be the same on the Linux agent and on the computer from which you run Xgrid.app (if they aren't it will tell you that the command is invalid). Note the following major restriction: large outputs/files will not get sent back properly and the agent will hang (see bugs below).
Finally, you can make a Custom plug-in: that's where it becomes interesting. If you want to execute a bourne shell script, then everything is fine (they are portable across Unix platforms):
with Test.sh being:
(make sure Test.sh is executable with chmod +x Test.sh). You will get someFile1.txt copied back in the destination directory, as well as "Some text to stdout" in the stdout file.
The custom plug-in can also send a binary executable to the agent and execute it, after which it sends the results back. Since you can't know ahead of time which node of your cluster will run what, then you must provide a binary for each type of agent you have (or you must compile it each time). Assuming you know that you have both Darwin on Power PC and Linux on i686 agents, then you can do the following:
where cal and ncal are the binaries for each platform and the shell script chooseAndRun.sh is:
The script will figure out what architecture it is running on and call the appropriate binary. This is the starting point for a multi-architecture calculation: one would provide all the binaries for all the possible agents and make a script similar to the one above to carry out the calculation.
Notes on usage (also known as bugs):
If you want to modify the code, then here are a few general warnings and comments:
Here is a graphical overview of the code:
Specific comments and pitfalls:
To do:
There are various layers in the Xgrid agent:
The XGrid protocol is actually quite simple to understand, since there are only three types of messages that can be passed: a request (to which one replies) or a notification (to which there is no need to reply). Each message is identified with a CorrelationID, a name, a type (request/reply/notification) and a payload (which contains something specific to current message (identified by name)). The XGrid protocol is also the application protocol (that's what the application understands) and has nothing to do with the actual communication protocol (tcp/ip, beep, etc...). Here is a graphical overview of the cient registration process as well as the task submission process: View Registration image, View Task Submission image
Each XGrid message is sent as a BEEP MSG, and must be acknowledged when received completely by an empty RPY. MSG's can be sent in smaller chunks (frames). The implementation of BEEP that is used in this xgridagent is Roadrunner, but there is also beepcore-c (which is not as flexible).
It is convenient but not necessary that both XGrid and BEEP rely on XML. Some BEEP information (in the initiation of the connection for instance) is encoded in XML. XGrid uses XML extensively, which makes it trivial to analyze.
Because two computers are talking to each other over the network, it is convenient to use threads for the BEEP library. This means that there is no "single point" in the code where one can follow the execution: it looks like several parts are running in parallel. To make sure that the various threads can talk to each other, one uses a simple locking mechanism (mutex) or a signalling system (sempahores).
There remain a few important bugs in the agent code, but they should be worked out quickly if others look at the code. It can be used for simple examples for now, involving agents of different architectures on the same cluster. Since the Xgrid application protocol is platform agnostic, this agent can be used to bring any Unix machine into an XGrid cluster. Since XGrid can be tunneled through SSH (see XGrid documentation), then it can be integrated in a secure research environment. The official reference for this site is: "XGrid agent for Unix architectures" available at http://www.novajo.ca/xgridagent/. Any question or comment can be sent to Daniel Côté (OCI, U of T) or to dccote@novajo.ca.
This work was done with help and encouragements from the XGrid team and Ernest Prabhakar at Apple.
Posted by dccote at June 21, 2004 11:51 PM | TrackBack