About the company
At Jobtome - we are building a modern, cloud-native recruitment and marketing platform used at scale across multiple countries and brands.
Our systems power high-traffic job distribution, integrations with external partners, and real-time data pipelines, with a strong focus on reliability, observability, and automation.
Engineering is a core function of the company: we value ownership, pragmatic decision-making, and long-term technical excellence over short-term fixes.
The role
As a Senior Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our production systems.
You will work closely with Backend, Frontend, and Product teams to:
* design resilient architectures
* define reliability standards
* improve observability and incident response
* reduce operational toil through automation
This is not a pure ops role: you will contribute to codebases, collaborate on system design, and help evolve our engineering culture toward SRE best practices.
What you will do
* Design, implement, and maintain reliable and scalable cloud infrastructure
* Define and evolve SLIs, SLOs, and error budgets
* Improve monitoring, alerting, and observability across services
* Lead and participate in incident response, post-mortems, and root-cause analysis
* Automate repetitive operational tasks to reduce toil
* Collaborate with Backend engineers on service design, scalability, and failure modes
* Improve CI/CD pipelines, deployment strategies, and release safety
* Contribute to infrastructure as code and platform tooling
* Act as a reliability advocate across the engineering organization
Tech stack
* Cloud: Google Cloud Platform (preferred), AWS
* Containers & orchestration: Docker, Kubernetes (GKE)
* Infrastructure as Code: Terraform
* CI/CD: GitLab CI/CD
* Observability: Cloud Monitoring, Logging, Prometheus, Grafana
* Languages: Go, Python, Bash
* Networking & security: IAM, VPCs, service accounts, secrets management
What we expect from a senior SRE
* Strong experience running production systems at scale
* Solid understanding of distributed systems and failure modes
* Proven experience with SLO-driven reliability
* Strong coding skills
* Cloud infrastructure automation experience
* Ability to debug complex cross-system issues
* Ownership mindset and strong communication skills
* Pragmatic approach to reliability, speed, and cost trade-offs
Working model
* Flexible working hours
* Remote-friendly setup
* Small autonomous teams
* Direct collaboration with product and leadership