Job Description
As a Site Reliability Engineer / SRE you will be a senior member of a small team that combines software and systems engineering to build and run large scale, massively distributed, fault-tolerant systems deployed to AWS.
Observability is key - you'll be using a range of tools to monitor systems for failure mode detection and pro-actively fixing things on production systems before they go wrong. You'll have oversight of how systems relate to each other; limit time spent on operational tasks; automate wherever possible; carryout blameless post-mortems and proactively identify potential outages, continually iterating to make improvements.
As a senior member of the team you'll have a great deal of input into technical discussions and decisions and help to mentor more junior team members.
There's a fully remote interview and onboarding process as well as the ability to work from home fulltime for the foreseeable; when possible you'll join colleagues in the London office for 1-2 days a week.
Requirements:
*You have experience in a similar Site Reliability Engineer / SRE
*You have experience with monitoring and tracing - e.g. Prometheus, Honeycomb, Grafana, ELK
*You have a good appreciation of IaC (Infrastructure as Code), CI/CD and modern tooling such as Terraform, Concourse, Jenkins
*You've got a good knowledge of AWS
*You're able to script (or code ideally) with Python, Go, Perl, Ruby, C, C++ or Java
*You're familiar with DevOps environments / Containerisation (Docker, Kubernetes)
*You have excellent communication skills; collaborative and personable - happy to help take a lead on projects and provide mentoring
As a Site Reliability Engineer / SRE you will earn a competitive salary (to £100k) plus benefits.
Apply now or call to find out more about this Site Reliability Engineer / SRE opportunity.
