[Email draft] seff - a new HPC tool to understand job efficiency

> Audience: all HPC users > How we want them to feel: curious, empowered > What we want them to do: try the tool, and use it to make more appropriate resource requests! # [Email draft] seff - a new HPC tool to understand job efficiency As [King's Climate and Sustainability Month](https://www.kcl.ac.uk/events/series/climate-sustainability-month-2026) draws to a close, we're pleased to announce we're launching a new tool you can use to check the efficiency of your HPC jobs and make your research more sustainable. `seff` (pronounced 'es-eff') uses Slurm's data on the resources requested and the resources actually used by a job to report how efficient an HPC job is. **Why does efficiency matter?** Over the last six months of 2025, 3.7 million jobs were run on CREATE HPC. Of the 2.8 million jobs that completed successfully or were cancelled due to using more resources than they requested, the average CPU efficiency was 60% - 1.1 million jobs used less than 50% of the CPU time they requested! Of the 1.2 million jobs that requested more than one CPU, the average CPU efficiency was only 32%, with 75% of jobs using less than 50% of the requested CPU time. The average memory efficiency over the same period was 19%. https://github.com/nadinespy/hpc-data-analysis/blob/main/results/plots/2026-02_sustainability/cpu_efficiency_density_split.png https://github.com/nadinespy/hpc-data-analysis/blob/main/results/plots/2026-02_sustainability/mem_efficiency_density.png It's clear from this data that many jobs are requesting more resources than they actually need. This can lead to resources sitting idle rather than being effectively used, and jobs waiting longer for their requested resources to become available, rather than running quickly! `seff` can help you identify the correct resource requests for your jobs - which in turn should lead to faster results and fairer sharing of the cluster resources. In addition, efficient use of hardware is one of the key principles of sustainable computing. `seff` will help e-Research build on our [Green DiSC](https://www.software.ac.uk/GreenDiSC) Bronze award to make computational research at King's more sustainable. You can find out more about [green computing](https://docs.er.kcl.ac.uk/green-computing/) and the work we're doing in this area on our docs pages. **How to use `seff`** `seff` is installed cluster-wide and can be used by passing your job ID to the `seff` command. Example: ``` $ seff 1234567 Job ID: 1234567 User: k1234567 State: COMPLETED Elapsed time: 10:01:19 GPU(s) allocated: none CPU Efficiency: 95.4% of 6 core(s) used Memory Efficiency: 73.3% of 3000 MB used ``` This example job requested 6 CPUs, and used 95.4% of the total possible ~60 hours of CPU time. It used 73.3% of the requested 3 GB of memory. **Next steps** Measuring job resource efficiency is just the first step - improving it is not always as simple! We've put together some tips on key things to consider, which you can read [here](seff-details-and-tips) on the e-Research forum. We hope that you'll find `seff` useful! It's been tested on a range of different jobs, but there's always the possibility of bugs and edge cases, so please leave bug reports, feedback, and questions in the forum thread [here](seff-details-and-tips). Best wishes, Max Wyatt, Nadine Spychala and Liz Ing-Simmons on behalf of e-Research ------ # [Forum post draft] Measuring and improving the efficiency of your HPC jobs We're pleased to announce we're launching a new tool you can use to check the efficiency of your CREATE HPC jobs. `seff` (pronounced 'es-eff') uses Slurm's data on the resources requested and the resources actually used by a job to report how efficient an HPC job is. We've carried out some data analysis (detailed results to be released soon) that reveals that many HPC jobs are not making the most effective use of the resources allocated to them. This means that resources may be sitting idle and jobs waiting in the queue for longer than they need. The aim of releasing `seff` is to help all CREATE HPC users identify the resources their jobs actually need, leading to more efficient cluster utilisation and faster results for everyone. ## What is `seff`? `seff` on CREATE HPC is a Python reimplementation of a [plugin](https://support.schedmd.com/show_bug.cgi?id=1611) for the Slurm scheduler that was originally written in Perl in 2015. Max Wyatt, in the Research Operations team, has converted the script to Python and added some new features for GPU efficiency measurements. `seff` can: * Report CPU, memory, and GPU efficiency for a completed job * Report GPU efficiency for a running job, if invoked on the node that job is running on ## Usage `seff` is installed cluster-wide and can be used by passing your job ID to the `seff` command. Example: ``` $ seff 1234567 Job ID: 1234567 User: k1234567 State: COMPLETED Elapsed time: 01:01:57 GPU(s) allocated: 2 GPU Utilization: avg 76.5%, max 76.5% GPU Memory Usage: avg 10834 MB, max 10834 MB CPU Efficiency: 97.7% of 24 core(s) used Memory Efficiency: 3.2% of 20000 MB used ``` This example job requested 24 CPUs, and used 97.7 % of the total possible ~24 hours of CPU time. It used 76.5 % of the 2 GPUs it requested. In addition to ~10 GB GPU VRAM, it used 3.2% of the requested 20 GB RAM. ## How does it work? `seff` internally uses the Slurm `sacct` command to retrieve information about finished jobs. This information is used to calculate efficiency values. The number of requested CPUs is identified from the `AllocCPUs` field, and the `Elapsed` fields gives the length of the job. This is used to calculate the total amount of CPU time that could have been used by the job (*n* CPUs x elapsed time). The actual CPU time used is taken from the `TotalCPU` field. For jobs that contain multiple steps and where an overall `TotalCPU` value is not available, the CPU time used is summed across all steps. The CPU efficiency is calculated as `100 * CPU time used / total possible CPU time`. For memory efficiency calculations, the requested memory is identified from the `ReqMem` field and the maximum memory used by the job is taken from the `MaxRSS` field. For jobs with multiple steps and where an overall value is not available, the maximum of the `MaxRSS` of all steps is taken. Retrieving data on GPU utilisation is a bit more complicated, as this is stored in the `TRESUsageInAve` and `TRESUsageInMax` fields, which store multiple pieces of information about different "Trackable Resources" (TRES). The average and maximum GPU utilisation and GPU memory usage are derived from these fields. For jobs that are currently running, no information is available via `sacct`. However, for GPU jobs, the `nvidia-smi` command can be used to retrieve information about running jobs. `seff` uses this command to find out and report the average GPU and memory utilisation. Since `nvidia-smi` can only access information about the GPUs on the node it's run on, this requires that the `seff` command is run on the node that your job is using. You can use `ssh` to connect to the node your GPU job is running on and run `seff` there. ## How can I improve my job efficiency? ### CPU efficiency By default, each job is allocated 1 CPU. You can request additional CPUs by increasing `--ntasks` or `--cpus-per-task` (for most jobs, `--cpus-per-task` is most relevant). However, it's important to ensure that your job is making effective use of the additional CPUs. This may require changes to your code. #### Ensure your code makes use of extra resources ``` $ seff 1234569 Job ID: 1234569 User: k1234567 State: COMPLETED Elapsed time: 3:22:03 GPU(s) allocated: none CPU Efficiency: 19.4% of 5 core(s) used Memory Efficiency: 65.8% of 1000 MB used ``` In the example job above, 5 CPU cores are requested, but only about 20% of the total allocated CPU time is used - as if only one core is actually being used. If you see a case like this, an important first check is to make sure you've added any flags that are necessary to your code to ensure that it can make use of the additional CPUs. For a command line program, this might look like adding `-p 5` or `-t 5` - check the program docs to find out how to specify this for your program. For Python code, you might need to use the [`multiprocessing` module](https://docs.python.org/3/library/multiprocessing.html). For R scripts, you might need to load the [`future` package](https://future.futureverse.org/) and specify a `plan()`, or [adapt your code](https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html) to use functions like `mclapply`. #### Scaling up ``` $ seff 1234570 Job ID: 1234570 User: k1234567 State: COMPLETED Elapsed time: 6:34:03 GPU(s) allocated: none CPU Efficiency: 95.4% of 1 core(s) used Memory Efficiency: 65.8% of 1000 MB used ``` This example is of a job that requests one CPU core, and uses that very efficiently. Adapting your code to use more CPUs should make the same job faster: ``` $ seff 1234571 Job ID: 1234571 User: k1234567 State: COMPLETED Elapsed time: 1:14:03 GPU(s) allocated: none CPU Efficiency: 85.3% of 8 core(s) used Memory Efficiency: 65.8% of 1000 MB used ``` However, as shown in this example, this scaling is often not linear. This job uses 8x the CPUs, but is not 8x as fast - and has slightly lower CPU efficiency. This is because there are always some parts of a computing task that aren't parallelisable, and don't benefit from the increased number of CPUs. These might include loading packages, initial set up steps, and reading or writing data. If you want to learn more about code scalability and the factors that affect it, [this training material](https://srsg-workshops.github.io/HPC-Skills/dirac-code-scaling-scalability-profiling) is a good place to start. For this task, it's likely that further increasing the number of CPUs would lead to an increased speed up - but lower CPU efficiency. It's important to find a balance between these factors. For this job, 8 CPUs might be the right number - 85% efficiency is still pretty good, and the job only takes an hour to run. When you're scaling up your own code, experiment with the number of CPUs you request and see if you can find the right balance between speed and efficiency. ### Memory efficiency The default memory allocation is 1 GB, which is sufficient for many jobs. However, large data analyses and other tasks may require more memory than this. The best way to estimate how much memory a job will need is to look at how much memory previous similar jobs used. However, memory requirements can vary depending on the input data or parameters specified for the job. Therefore, it's good to add a buffer to allow for this variation. ``` $ seff 1234567 Job ID: 1234567 User: k1234567 State: COMPLETED Elapsed time: 10:43:19 GPU(s) allocated: none CPU Efficiency: 95.4% of 6 core(s) used Memory Efficiency: 1.3% of 6000 MB used ``` This example job requested 6 GB (6000 MB) of memory, but only used 1.3% of this. For future jobs running similar analyses, the user could safely leave the memory request at the default 1 GB. ``` $ seff 1234568 Job ID: 1234568 User: k1234567 State: COMPLETED Elapsed time: 1:22:03 GPU(s) allocated: none CPU Efficiency: 91.2% of 1 core(s) used Memory Efficiency: 85.7% of 20000 MB used ``` This example job used 85.7% of the requested 20 GB (20000 MB) of memory. Similar future jobs might use a little more or less, but 20 GB is a sensible value to start with. If a future similar job failed due to an out-of-memory error, we'd suggest re-trying it with 25 GB or 30 GB of memory. It can be frustrating to have a job fail due to using more memory than allocated, especially if your job takes a long time to run and reaches peak memory usage towards the end of its runtime. In those cases it's sensible to build in checkpoints for your job (to save intermediate progress), or if possible rework your code to test peak memory usage early in the job. If that isn't possible, you can request a large amount of extra memory - for example if your job fails with 20 GB, you could request 100 GB. However, be aware that larger memory requests may cause your job to wait longer in the queue. If requesting a large amount of extra memory, it's also important to check the actual memory usage and adjust to an appropriate value for future jobs, in order to maintain good utilisation rates of cluster resources. ### Other considerations #### Identifying the most CPU or memory intensive steps As mentioned above, not all computational tasks can benefit from parallelisation. In complex pipelines, it may be that only some steps in the pipeline can effectively use all the requested resources, leading to a low overall efficiency. If this is the case, you might be able to make your work more efficient by breaking the pipeline into chunks that request different amounts of resources. You can use profiling tools to identify the parts of your code that take the most time or memory, and prioritise optimising those parts or splitting them into independent modules that can be scaled up. The [Reasonable Perfomance Computing](https://sig-rpc.github.io/) website has resources on profiling and optimisation for different programming languages. #### Walltime While not directly related to job efficiency, another key resource to consider is the amount of walltime you request for your job (`-t` or `--time`). The default time limit for jobs on CREATE HPC is 1 day. Shorter jobs are easier for Slurm to schedule, so if you know your job will only take one or two hours, you should adjust the timelimit you request. This should help your jobs be scheduled faster and contribute to fairer resource allocation. However, jobs will be killed when they reach the end of their allocated time, so it's important to include some buffer time in your request! ## Getting help There are many approaches and factors to think about when making your code more efficient. These tips cover some of the most common scenarios and easiest things to try. If you're struggling to apply these tips, your situation is more complex, or you want to really optimise your efficiency, here are some resources for further help: * Ask a question on this forum - maybe someone else uses the same software tool as you and can share tips and tricks * Come to one of the [e-Research training workshops](https://docs.er.kcl.ac.uk/training/) on Profiling and Optimisation for Python * Sign up for a [Research Software Code Clinic](https://forum.er.kcl.ac.uk/t/research-software-code-clinics/3255) - 30 minute sessions with one of our Research Software Engineers, providing guidance on optimisation and more * Contact us via support@er.kcl.ac.uk for more complex queries or to investigate a longer-term collaboration We hope that you will find `seff` a useful tool that will help you use the cluster more efficiently, get results faster, and make your research more sustainable. Questions, bug reports, and feedback are welcome - please post in this thread!