2 weeks ago Be among the first 25 applicantsGet AI-powered advice on this job and more exclusive features.We are seeking a highly skilled and experienced Senior Platform Observability Engineer to join our team. In this role, you will be responsible for ensuring the reliability, scalability, and efficiency of our core observability infrastructure that supports our engineering teams and customer-facing portal. Your work will include evolving these systems and participate in fostering adoption of observability best-practices in the organization.You are excited by the prospect of managing more than 20 TB of telemetry data per day, originating from a fleet of 10 000+ nodes (including linux hosts, k8s clusters, VMs).
Overview
Senior Platform Observability Engineer responsible for reliability, scalability, and efficiency of the observability infrastructure that supports engineering teams and the customer-facing portal. Focus on evolving systems and promoting observability best-practices across the organization.
Key Responsibilities
Observability Platform Operations: Configure, operate, and enhance observability platforms and frameworks (Clickhouse, Thanos, Loki, Tempo, OpenTelemetry Collector + custom processors).
Drive adoption of observability practices with comprehensive monitoring, logging, and tracing across the organization.
Develop and maintain automated solutions for monitoring, alerting, and incident response.
System Optimization
Collaborate with engineering teams to understand needs and provide robust, scalable solutions utilizing the observability platform.
Optimize system performance and ensure high availability through proactive monitoring and maintenance.
Develop and implement strategies for cost optimization, capacity planning, and performance tuning.
Innovation and Improvement
Stay up-to-date with industry trends, tools, and technologies to drive continuous improvement.
Experiment with and implement new tools, especially around observability and telemetry, to enhance platform capabilities.
Evaluate and integrate OpenTelemetry Collector where beneficial to enhance telemetry data collection and analysis.
Essential/Required Skills
Observability Platforms: Proven track record in managing at least one of the following stacks: Thanos, Mimir, Cortex, Tempo, Loki or Clickhouse; with the ability to configure, operate, and improve these systems.
Kubernetes: Deep understanding of Kubernetes architecture and hands-on experience in managing resources on clusters.
Helm: Experience in writing and maintaining Helm charts, and understanding third-party charts to deploy and manage Kubernetes resources efficiently.
GitOps: Experience in continuous delivery and GitOps practices (version control, CI/CD pipelines).
Docker: Expertise in containerization, orchestration, and optimization of Docker workloads.
Linux: Proficiency in Linux system administration, including scripting and automation.
Desirable Skills
Coding Experience: Golang or similar language.
Open Source: Contributor to open source projects written in Golang or similar language.
OpenTelemetry Collector: Knowledge of the OpenTelemetry Collector or contribution to the project.
Soft Skills
Quick Learner: Ability to quickly grasp new concepts and technologies, adapting to evolving needs.
Communication: Excellent communication skills for technical and non-technical stakeholders.
Customer Focus: Awareness of customer needs and the impact of platform operations.
Collaborative Mindset: Ability to work collaboratively in cross-functional teams and drive continuous improvement.
Education and Experience
Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience).
5+ years of experience in platform engineering, site reliability engineering, or related role.
Demonstrated experience in managing large-scale infrastructures and observability platforms (such as Thanos, Mimir, Cortex, Tempo, Loki, Clickhouse).
What we offer
You’ll be among people who believe in caring, passionately about keeping our customers safe. We’re dedicated to solving problems, whatever it takes. Think unconventionally to stay ahead. Do the hard work to make things simple. Work collaboratively to build success. Open Systems has been recognized as an outstanding place to work. You’ll be surrounded by smart teams who enrich your experience and provide opportunities to develop your skills and advance your career.
We look forward to receiving your online application (please note that you have to compress your application into two attachments).
Come as you are! We search for amazing people of diverse backgrounds, experiences, abilities, and perspectives. Open Systems welcomes and encourages diversity in the workplace regardless of race, gender, religion, age, sexual orientation, disability, or veteran status.
Direct applications only will be considered.
About Open Systems
Backed by the Service Experience Promise, Open Systems connects and secures hybrid environments to meet business objectives in a cost-effective manner. We focus on a superior user experience to help organizations reduce risk, improve efficiency, and accelerate innovation. The Open Systems SASE Experience delivers ZTNA with a unified SASE platform, combining SD-WAN and Security Service Edge as a Service. We provide 24x7 operational management and engineering support with affordable and predictable costs.
Referrals increase your chances of interviewing at Open Systems by 2x
Note: This description reflects the current responsibilities and requirements and is subject to change.
#J-18808-Ljbffr