Our client, a leading proprietary trading firm specialising in both systematic and discretionary strategies, is seeking a Site Reliability Engineer to join their Zurich office. This is a unique opportunity to evolve and enhance a highly sophisticated production trading environment, ensuring exceptional uptime and performance. The role focuses on delivering code-driven solutions while partnering closely with developers and traders to strengthen reliability, observability, and overall operational maturity within a low-latency, high-performance ecosystem.
The ideal candidate will bring deep experience supporting highly available, performance-critical, latency-sensitive systems, alongside a strong understanding of Linux internals and networking. A solid background in reliability engineering is essential, with a clear automation-first mindset and hands-on experience with containerisation technologies.
Key responsibilities:
* Reliability & Production Ownership: Own availability, stability, and performance of Linux-based trading systems (RedHat, Rocky, Ubuntu).
* Incident Response: Lead incident management, on-call, and blameless post-mortems, driving automation to prevent recurrence.
* Operational Processes: Maintain runbooks, documentation, and standards for consistent production support.
* Production Readiness: Partner with developers and traders to ensure reliable, high-performance system design and deployment.
* Linux Systems & Performance: Perform low-level tuning (CPU, IRQ, memory, networking) for latency-sensitive workloads.
* Performance Diagnostics: Troubleshoot using perf, ftrace, tcpdump, and eBPF.
* Automation & Infrastructure: Deliver infrastructure as code with Ansible, Terraform, Python, and shell scripting.
Required Qualifications:
* Experience in Site Reliability Engineering, Linux engineering, DevOps, or infrastructure-focused roles.
* Production Systems: Proven experience supporting highly available, performance-sensitive production environments.
* Linux Expertise: Deep knowledge of Linux internals, including scheduling, memory management, interrupts, filesystems, and storage.
* Networking: Strong understanding of TCP/IP, UDP, multicast, and distributed systems networking.
* Automation & Tooling: Proficiency with Ansible, Terraform, Python, shell scripting, YAML/JSON, and Git-based workflows.
* Containers & Observability: Experience with Docker (or similar) and familiarity with observability tools such as Prometheus, Grafana, ELK, or equivalent.