1. Overview

This page describes the various pages available relating to Condor (local manual) including user information, the policies as to how the pools are setup, and how to setup and manage a machine.

2. User information

The processor bank is running a condor pool, as per the local documentation

3. System Policies

Below are the policies for

We may also have ad hoc policies for

Users may have access to

4. Main Processor Bank pool

This section has parts copied from the old 6.6 info, which are in italics like this paragraph. They still need to be checked and corrected. In particular, if we have any standard universe users, we may need to do the linker tweak.

4.1. What works ?

4.2. Job submission (limited submit hosts: condor-submit{,0,1})

Due to the state that is held on the submit hosts, the general pool machines are not used to submit jobs, so that if a pool machine fails, jobs are not stuck. Instead a small number of "reliable" submit hosts are used. Connect to condor-submit, but if there are problems, try condor-submit0 and condor-submit1.

5. "Paddling Pool", pb000

This has been set up mainly to test behaviour of condor with new FC kernel releases, it is not intended for real user use. Condor used to be sensitive to the nature of /proc/meminfo and various FC kernel releases have caused condor to stop working. Symptom - jobs start ok but immediately go into idle and stay there. The Shadow log will say something about insufficient swap space (this is despite the configuration setting RESERVED_SWAP = 0, which is supposed to stop it trying to calculate the amount of swap available). Two different FC3 kernels have provoked this bug - the only solution is to change kernel. As this can only be tested when a kernel has been installed it is the intention that this machine gets the new kernels first, condor can be tried with a simple script, and if it runs the new kernel is ok to go onto the other pool machines.

To this end pb000 has a non standard condor_config.local - it has DAEMON_LIST and CONDOR_HOST commented out (and an unequal split).

5.1. Machine tuning

The current state of the pool machines can be found using cl-condor-list and new ones can be started using cl-condor-start specifying at least the machine to start, probably the number of cores and the amount of memory, and maybe a XenE server in the appropriate pool.

6. History

Recent history of run jobs can be found by running condor_history on the submit host. Older information can be found in /usr/groups/linux/condor-history/. The command condor-h.pl in that directory can summarize a history file. As of 2009/07 we have used over 27 CPU years.

7. System management (Sys Admins only)

Below are notes on how to manage condor machines. They are intended for system administrators rather than condor users. However they may be of use for users wanting to setup their own mini pools. Please contact sys-admin if there are bits that User Admins can do for themselves.

7.1. Daemon Control

Changes to /opt/condor-6.8.3/LOCAL/condor_config.local can be notified to the daemons using general or specific commands such as

/opt/condor-6.8.3/sbin/condor_reconfig
/opt/condor-6.8.3/sbin/condor_reconfig -subsystem startd

A daemon can be temporarily (until next restart) drained (put into Retiring state) using commands such as

/opt/condor-6.8.3/sbin/condor_off -name pb000 -peaceful -subsystem startd

but it appears that this has to be done from the master system, otherwise it logs:

==> /opt/condor-6.8.3/LOCAL/log/MasterLog <==
3/29 08:34:23 DaemonCore: PERMISSION DENIED to unknown user from host <128.232.1.202:9678> for command 483 (DAEMON_OFF_PEACEFUL)
==> /opt/condor-6.8.3/LOCAL/log/StartLog <==
3/29 08:34:23 DaemonCore: PERMISSION DENIED to unknown user from host <128.232.1.202:9635> for command 60016 (DC_SET_PEACEFUL_SHUTDOWN)

If scheduling appears odd (e.g. Idle machines but waiting jobs and "condor_q -g -an" doesn't show obvious requirements), the handle can be cranked (even by users) using

/opt/condor-6.8.3/sbin/condor_reschedule

7.2. Bugs and problems

7.2.1. machines not used due to no JAVA or shortage of disc space

Sometimes apparently suitable machines are not used when jobs are waiting. If they need the JAVA universe, check that JAVA is correctly configured. Another possible problem is shortage of disc space, which can probably be fixed by clearing out /tmp, expunging cached RPMs using "{cl-asuser yum clean all}", or by deleting old kernels.

7.2.2. short jobs kill the server

Having (lots of?) small (less than 5 min) jobs causes machines to be Allocated but Idle, and they may get stuck in this state. Encourage users to make jobs last at least 5 mins.

7.2.3. schedd doesn't start for a while on reboot: system lock file

It appears that the SCHEDD.lock lock file tends to be left when the machine stops, so on reboot the schedd process does not start until a timeout has occurred. as such it may be necessary to delete /usr/groups/linux/condor-queue/$HOSTNAME/SCHEDD.lock, then type "cl-asuser service condor start".

7.2.4. schedd doesn't register after crash: user logfile locked

After a "short job" meltdown, schedd was running (as seen by "cl-asuser service condor status"), but was not registered with the negotiator (e.g. not seen in condor_q). The schedd was reported as running as the user who caused the meltdown. strace revealed that it was stuck trying to lock a file. lsof revealed the the file was a logfile for one of the queued jobs. By temporarily renaming the file and re-running "cl-asuser service condor start", it started OK.

8. System setup (Sys Admins only)

Below are notes on how to setup condor machines. They are intended for system administrators rather than condor users. However they may be of use for users wanting to setup their own mini pools. Please contact sys-admin if there are bits that User Admins can do for themselves.

8.1. Port Usage

The one fixed port is that of the condor_collector which defaults to 9618. All machines use the $COLLECTOR_HOST:9618 (unless COLLECTOR_HOST has an explicit ": $port") to locate all other services for the pool. It is available using UDP or TCP, but the default is to use the former only.

All other ports are selected by the members of the pool from the range [service.][IN]LOWPORT - [service.][IN]HIGHPORT if set, and reported to the collector. The uses (one port number, on tcp and udp unless otherwise indicated) are:

Some ports may be in CLOSE_WAIT or such like states, so to avoid transient problems (typically requiring a retry 300s later) make the range larger than strictly needed. The docs say that it needs "5 + 5*CPUs" ports for an execute service, so 20 should be enough.

8.2. iptables hardening

As the daemons run as root (albeit with real userid changed to condor), the machines are hardened using iptables. As well as the additions to allow incoming condor traffic described below, the input rules are amended to allow ssh (note that users cannot ssh in to most pool machines) only from VLAN 100 (128.232.0.0/20). Output is also restricted, allowing DNS, NFS, NTP, LDAP, SMTP, condor and HTTP to the CS RPM repository.

8.3. Client setup (execute)

As the condor RPM looks for the user condor in /etc/passwd to decide how to configure the system (it looks for "condor:", so it is not actually necessary to add a valid condor entry), it has to be HACKed in before the RPM is installed.

grep condor /etc/passwd > /dev/null ||
  sudo bash -c "echo condor:x:78:78:Condor:/var/condor:/bin/bash>>/etc/passwd"
grep ^@R-condor /etc/user-config/bundles > /dev/null ||
  echo @R-condor >> /etc/user-config/bundles
cl-update-system

To allow other machines to contact the client, ports need to be opened in the iptables configuration. Add to /etc/sysconfig/iptables the ports selected in condor_config.local:

#1 condor execute server
-A RH-Firewall-1-INPUT -p udp -m udp --dport 9600:9619 -s 128.232.0.0/17 -j ACCEPT

and to open them without restarting iptables, use commands such as

cl-asuser iptables -I RH-Firewall-1-INPUT 11 -p udp -m udp --dport 9600:9619 -s 128.232.0.0/17 -j ACCEPT

Check that the ports match using

cl-asuser service condor status

Dual CPU machines can be setup to have an asymmetric split by setting NUM_VIRTUAL_MACHINES_TYPE_ $N to be 1 for $N = 1, 2, and call a wrapper to set the soft memory limits

NUM_VIRTUAL_MACHINES_TYPE_1     = 1
NUM_VIRTUAL_MACHINES_TYPE_2     = 1
STARTER                         = /usr/sbin/condor_starter-assym
STARTER_LOCAL                   = /usr/sbin/condor_starter-assym

8.3.1. Server setup (manager and submit)

Other condor systems can be setup as per a client, but need subsequent configuration. The DAEMON_LIST is likely to be changed, and HIGH_PORT raised to allow for the extra accesses. condor_config.local should have appended the lines:

HIGHPORT = 9699
DAEMON_LIST = MASTER, SCHEDD

8.3.2. Server Setup (manager)

The documentation says that a high availability manager setup can be arranged, but current attempts to use HAD fail.

The manager server is a xen domU condor68-negotiator-0 which has a CNAME condor68-collector-0 as these services are nearly always co-located. Clients actually use the name condor68-negotiator which is meant to be a multi A RR, but the HAD service fails (exception), so it currently lists only -0. Thus to move the service to another machine, shutdown the main server, change another box to have "COLLECTOR, NEGOTIATOR" on the DAEMON_LIST line, add a pseudo IP address ("ifconfig eth0:241 128.232.9.241 netmask 255.255.255.255"), add "NETWORK_INTERFACE = 128.232.9.241" ("BIND_ALL_INTERFACES = TRUE" does not work for pseudo IP addresses, as it sends packets with the primary IP address) so that it does not use just the primary IP address, and stop and then start the service. The clients should notice within 300 seconds.

8.3.3. Server Setup (submit)

There can be as many submit servers as needed, but they need to be reliable (as they are needed throughout the execution of a job), need extra ports, and may need disc space for files.

As the pool is unlikely to be much use without elmer, putting the submit queue on elmer does not increase dependencies significantly, but does provide snapsots and allows another machine to (explicitly. Unfortunately there is no way for the submit servers to be told which is the "primary" server, i.e. the one on which users actually submit jobs) take over a queue is the normal server is not available (we may use the high availability submit server at some point. As all it does it take a lock before starting SCHEDD, it may as well be used by default. However, there seem to be problems with the lock file SCHEDD.lock as condor_preen deletes it by default, and the lock appears to be "file exists" rather than "file has a lock", so if a server dies without removing it, on reboot, schedd will not start, until preen has deleted it. As such, do not include SCHEDD on DAEMON_LIST, but add the lines:

## cl.cam.ac.uk: may at some point move to HA submit server. Until then, this is harmless
MASTER_HA_LIST = SCHEDD
SPOOL = /usr/groups/linux/condor-queue/$(HOSTNAME)
HA_LOCK_URL = file:/usr/groups/linux/condor-queue/$(HOSTNAME)

8.3.4. System cloning

Since the pool moved to XenE, cloning is very simple. Any stopped pool machine can be (fast) cloned, and have its MAC address set to an unused DHCP registsred one.

8.4. Bugs and problems

8.4.1. CentOS 5.0 using 6.8.3 fails to connect without NETWORK_INTERFACE

Running 6.8.3 under CentOS 5.0, attempts to connect to the negotiator fail, as it binds the local end to 127.0.0.1. By adding NETWORK_INTERFACE to condor_config.local it uses the right interface.

SysInfo/Condor (last edited 2009-07-20 07:54:11 by PieteBrooks)