I have spent a couple of days helping someone to start running jobs on some of the large UK grid resources using grid methods. It is worth documenting the experiences, both what we achieved and what we didn't.
Objectives
Our plan was the following:
- Learn to use the Storage Resource Broker for data management;
- Learn to use the eMinerals metadata management tools;
- Learn to use the escience digital certificate for grid access;
- Learn to submit jobs to grid resources (NW-grid, National Grid Service, eMinerals minigrid) using the RMCS tool.
To add to this we had two benchmark targets:
- A simple test run using my ossia code; this is quick enough to be a good test case;
- A production job using the CASTEP electronic structure code.
Starting point
The requirement was to be able to run grid jobs from a windows laptop.
The grid tools we were planning to use are
- SRB client tools, including the Scommands, InQ and TobysSRB. InQ is an Explorer-like Windows interface to the SRB that has a long-standing tradition of usage. We have never looked at the windows Scommands client tools.
- Metadata tools, including the STFC Metadata Manager (a web interface) and a web interface to the RCommands. These require the use of an escience digital certificate. This was the safest part of the task.
- RMCS client command-line tools and the RMCS java GUI. RMCS requires that the user has an escience digital certificate. Here we were really exploring; the client command-line tools have been been compiled for windows but not tested in detail, and the RMCS GUI had not been used at all for production work as far as I could tell.
At the outset, we had a digital certificate exported from the browser that was used to request it, but it had not yet been prepared for use by Globus (the p12 certificate had to be split into the userkey.pem and usercert.pem key files).
We had a working test case for ossia. However, we didn't have a compilation for CASTEP.
Day 1: a dream start
Day 1 started with coffee at 11 am, followed by a walk through the various tools. The harder work started around lunch time. What we achieved in more-or-less the order I can remember was:
- We established that we had an account on the SRB. We walked through TobysSRB interface. We then downloaded the InQ explorer interface, and used this to upload some files (including setting up all the configuration details). Finally we installed the Scommand line-command binaries and tested them from the DOS shell. This included setting up the MdasEnv and MdasAuth files.
- We then sorted out the escience certificate. We installed OpenSSL on the windows laptop, but in the end we cut short the time by preparing the certificate for Globus on my mac laptop.
- We downloaded the java myproxy upload tool for the digital certificate. This required us to download and install java on the windows laptop. We then created a proxy and uploaded it. At the same time, we downloaded and installed the certificate authority certificates.
- We had accounts created for the RCommands metadata tool and the RMCS grid submission tool. We then looked at the Metadata Manager and the web interface to the RCommands, and created some metadata for a data object, having created a test study and dataset levels. This part was relatively easy.
- Our target grid resource was the NW-Grid, and we compiled a serial version of the XML-ised version of CASTEP. The parallel version was left to Day 2.
- We then installed and launched the RMCS java GUI, although at this stage we were not ready to use it. I attempted to show it in use on my mac laptop; although it worked okay for tasks I had used it for (namely to check on jobs), we came unstuck using it for job submission.
Day 2: not quite as good as Day 1!
On the second day some things worked and some didn't. It got off to a bad start when the resource on which we were compiling CASTEP started on a slow death, preventing us from carrying out the compilation of the parallel version of CASTEP and also preventing us from downloading the previous day's compilation of the serial version. Thus we could only aim at getting the ossia job ran.
Our first task was to install the RMCS command line tools on the windows laptop. We had a couple of instances of successful installation, but the particular laptop offered some resistance. It turned out that the problems arise from a prior installation of cygwin, but after coffee the problem was fixed.
We then set up for submitting jobs to the grid. We got the files into the right place on the SRB to start with, and edited the MCS file. We had a created metadata dataset prepared for the data object produced. We needed to log into each resource to create the srb and rcommands user configuration files. This raised an interesting issue. We aim to create tools that prevent users from logging in to grid resources, and from needing to install globus, but users need these tools in order to log in. In the event, one of our team installed these files on one of the NW-Grid resources, and we used the GSI-SSH java gui to log into the resource from which we were not debarred by the firewall. The GSI-SSH gui is quite nice for this sort of thing. It isn't really up to long-term production use because it is noticeably slower than a real terminal, but it is easier than having to install Globus or have someone create an account for you on a computer than does have Globus installed. There is a serious issue in this.
Then we had a couple of set-backs in using NW-grid. We first found that one grid resource was so busy that we were only queuing. Then the other resource had our account set up without being part of a user group that was needed. Thus in the end we reverted to running our ossia test case on the eminerals minigrid. All the setting up required was a bit of a faff, emphasising that the start-up time is not insignificant (although not necessarily any more so than faced by people running on any cluster).
Finally we were able to run our ossia job using rmcs on the eMinerals minigrid. This job created output files on the SRB and a complete set of metadata. Success - by mid-afternoon on Day 2 we had a complete grid job instance from the laptop with everything working. What was pleasing was that this is perhaps our first instance of doing this all from a windows computer and some of the components had not been tested in production mode.
The creator of the RMCS java gui fixed the problems with job submission for us. However, we ran into a problem getting this working on the windows laptop, a problem we have left for tomorrow morning.
Conclusions
Barring the fact that events conspired to prevent us compiling CASTEP, 2 days seems to be okay to go from no grid experience to running proper grid jobs – by which we mean jobs with proper data management and metadata collection, not just submitting a job to a cluster. There is a steep learning curve, but with preparation this can be flattened a bit. The biggest issue in my mind was the set-up required. It only took a couple of hours I think, but it was a bit of a chore. More to the point, it reveals some usability issues. Surely it should not be left to users to have to do all this if our aim is to have tools that do not encourage users to log in to the compute resources.
I have noted that this was the first time we have enabled someone to run grid jobs from a windows laptop. This was quite an achievement in my view, because on one hand most grid users actually are users of a unix/linux environment (including via mac OS X). Well, I know that people will submit jobs via portals, which are supposed to be OS-agnostic, but that isn't the point here. What we want are tools that users can run using command line tools.