Site Reliability Engineer - Monitoring

Epidemic Sound

OBS! Ansökningsperioden för denna annonsen har passerat.

Arbetsbeskrivning

We believe that bringing people together from different backgrounds, experiences and perspectives makes for a healthy workplace, a more successful business and a better world. We value diversity and encourage everyone to come and soundtrack the world with us.

At Epidemic Sound we are reinventing the music industry. Our carefully curated catalog, with over 40 000 tracks and 90 000 sound effects, is tailored for storytellers, streaming services, and in-store soundtracks. Countless clients around the world, from broadcasters, production companies, DSPs, and YouTubers rely on our tracks to help them tell their stories. Epidemic Sound’s music is heard in hundreds of millions of online videos daily, across millions of playlist streams, and in thousands of in-store locations. Headquartered in Stockholm, we’re spread across offices in New York City, Los Angeles, Seoul, Hamburg, and Amsterdam. We’re growing fast, have lots of fun, and are taking the music industry with us.

We are now looking for a Site Reliability Engineer with a strong focus on monitoring to join our dynamic SRE team. In this role, you will help drive best practices in monitoring and observability, help implement SLI / SLO / Error budgets and help product teams measure reliability.

How you will make an impact

- Enhance our monitoring capabilities using tools like Thanos, Prometheus and OpenTelemetry.

- Implement SLIs, SLOs, and tracing to optimize system performance and reliability.

- Collaborate closely with product development teams to ensure observability, resilience, and performance needs are met when building new features and services.

- Coach engineering teams in improving their monitoring strategy and best practices.

- Embrace teamwork through practices like code reviews, pair programming, and mob programming.

- Engage in continuous learning through hack-days, courses, conferences, and tech-talks, and share your knowledge with your colleagues.

We believe that to succeed in this role, you have experience in:

- Strong understanding of SRE as an engineering practice.

- Experience with monitoring tools, such as Prometheus (and a Prometheus HA layer), Tracing and a deep understanding of monitoring best practices.

- Solid understanding of modern web architectures, system design, and software engineering principles, with the ability to apply them in designing scalable and robust solutions.

- Proficiency in implementing SLIs and SLOs in a production environment.

- Strong programming skills in at least one language (We use Go and Python).

- Experience mentoring and supporting colleagues and engineering teams.

- Demonstrated ability to troubleshoot distributed systems and drive operational excellence, including writing architectural diagrams, best practices, standards, and operating procedures.

- Experience working with Kubernetes.

It would also be music to our ears if you have experience with:

- Google Cloud Platform.

- Progressive Deliveries.

- Service Mesh, ideally eBPF Cilium.

Curious to learn more about who we are and what we do? Check out our brand new "About us" page → https://www.epidemicsound.com/about-us/

We have lots of fun soundtracking the world and our annual Spring Bash (https://www.youtube.com/watch?v=NgnVp17IvAg) is an event that captures this perfectly. Take a look at our most recent one, a virtual celebration!

Application

Do you want to be a part of our fantastic team? Please apply, in English, by clicking the link below.

Sammanfattning

Arbetsplats: Epidemic Sound
1 plats
Tills vidare
Heltid
Fast månads- vecko- eller timlön
Publicerat: 12 april 2023
Ansök senast: 29 september 2023

Site Reliability Engineer - Monitoring

Arbetsbeskrivning

Sammanfattning

Liknande jobb

konsultuppdrag: Programmerare Java/JavaScript

Performance test Manager to Payments tribe at SEB Stockholm

Senior Systemutvecklare

Senior Systemutvecklare