Job Description
At CERN, the European Organisation for Nuclear Research, physicists and engineers are probing the fundamental structure of the universe. Using the world's largest and most complex scientific instruments, they study the basic constituents of matter—fundamental particles that collide together at close to the speed of light. The process gives physicists clues about how particles interact and provides insights into the fundamental laws of nature.
IT’s Compute and Devices Group is looking for a computing engineer to take over responsibility for the onsite High-Performance Compute (HPC) SLURM clusters, which have established use cases in the Organisation including for the ATS sector and the Theory department.
This position will be in the Compute and Configuration section, responsible for large‑scale compute, from the HPC farm through High Throughput Compute, Volunteer Computing and Configuration & Secret Management services.
Functions
* Ensure service delivery for the SLURM HPC clusters to the user community as the primary service manager and technical lead of the service.
* Serve as the escalation point for user community support requests, helping to gather requirements and usage best practices.
* Configure, upgrade, monitor the clusters and provide ongoing maintenance.
* Ensure high utilisation of the resources by interfacing with HTCondor brokered backfill of resources.
* Define procedures and best practices for the wider team in order to promote operational support coverage.
* Look for synergies with other team members and teams for management of compute resources, or access to high‑performance compute resources for the community.
Qualifications
Master’s degree or equivalent relevant experience in the field of Computer Science or a related field.
Experience
* Support experience of HPC or batch systems, ideally SLURM but knowledge of HTCondor or similar would be an advantage.
* Demonstrated knowledge of configuration management systems such as Puppet, Chef, Ansible or Terraform, and monitoring of distributed systems.
* Knowledge of system administration, in particular Linux environments.
* Dealing with user relations, user support and user requirements definition.
* Programming techniques and languages, in particular Python or Go.
Technical competencies
* Knowledge of operating systems (Linux).
* Knowledge of system configuration tools (Puppet, Ansible, Terraform).
* Architecture and design of ICT systems.
* Identification and selection of relevant emerging ICT technologies.
* Knowledge and application of software life‑cycle tools and procedures.
Behavioural competencies
#J-18808-Ljbffr