Tuning guide



Grid Engine is a full function, general purpose Distributed Resource Management (DRM)tool. The scheduler component in Grid Engine supports a wide range of different compute farm scenarios. To get the maximum performance from your compute environment it can be worthwhile to review which features are enabled and which are really needed to solve your load management problem. Disabling some of these features can have a performance benefit on the throughput of your cluster:

  • scheduler monitoring

    Scheduler monitoring can be helpful to find out the reason why certain jobs are not dispatched. However providing this information for all jobs at any time can be resource consuming and is usually not needed. To disable scheduler monitoring set 'schedd_job_info' to 'false' in scheduler configuration sched_conf(5).

  • finished jobs

    In case of array jobs the finished job list in qmaster can become quite big. Switching it off will save memory and speed up qstat commands, because qstat also fetches the finished jobs list. Set 'finished_jobs' to '0' in global confiugration sge_conf(5).

  • job verification

    Forcing a validation at jobs submission time can be valuable tool to prevent non dispatchable jobs in pending state. Especially in heterogenous environments with a varity of different execution nodes and consumable resources and every user having it's own job profile it can be a time consuming job to handle non dispatchable jobs. In homogenous environments with only a couple of different jobs a general expensive job validation usually can be omitted. Job verification is disabled by adding the qsub(1) option "-w n" in the cluster wide default requests (see sge_request(5)).

  • load thresholds and suspend thresholds

    The load thresholds are needed if you consciously oversubscribe your machines and you need mechanism to limit oversubscription and also suspend thresholds are used in connection with oversubscription. The other case in which load thresholds are needed is when the execution node is still open for interactive load which is not under control of Grid Engine and you wan't to prevent the node from being overloaded. If a compute farm is that easy, that each CPU at a compute node is represended by only one queue slot and no interactive load is expected at these nodes then 'load_thresholds' can be omitted. To disable both thresholds set 'load_thresholds' to 'none' and 'suspend_thresholds' to 'none' (see queue_conf(5)).

  • load adjustments

    The load adjustments are used to virually increase the measured load after a job has been dispachted. This mechanism is helpful in case of oversubscribed machines to align with load thresholds. Load adjustments should be switched off if they are not needed because they impose the scheduler some additional work in connection sorting hosts and load thresholds verification. To disable load adjustments set 'job_load_adjustments' to 'none' and 'load_adjustment_decay_time' to '0' in the scheduler configuration sched_conf(5).

  • scheduling-on-demand

    The default for Grid Engine is to start scheduling runs in a fixed schedule interval (see schedule_interval in schedd_conf(5)). The good thing with fixed intervals is that they limit the cpu time consumption of the qmaster/scheduler. The bad thing is that it throttles the scheduler artificially resulting in a limited throughput. In many compute farms there are machines specifically dedicated to qmaster/scheduler and in such setups there is no reason for throttling the scheduler.

    Scheduling-on-demand can be configured using the FLUSH_SUBMIT_SEC and FLUSH_FINISH_SEC settings in the schedd_params section of the global cluster configuration sge_conf(5). If scheduling-on-demand is activated the throuput of a compute farm is only limited by the power of the machine hosting qmaster/scheduler.