Site Reliability Engineer - Observability

Midasplayer AB

STOCKHOLM

OBS! Ansökningsperioden för denna annonsen har passerat.

Arbetsbeskrivning

Job Description
At King millions of players connect to our games every day and expect to continue playing from where they left off. All this user and game progression data is stored in our infrastructure. We are looking to find someone eager to help us engineer and manage the monitoring and observability environments at the heart of this ecosystem.
We believe that you share our passion for learning new things, coding (primarily in Python), quality, automation, continuous improvements, and actively building and upholding a great culture. Above all, we would like to see that you have a genuine interest in high performance observability.
We will review your incoming applications at the beginning of July due to vacation break!
Your role within our Kingdom
Our job is to build effective, stable and reliable large-scale infrastructure tools and services for our platform, games, and product teams. We strive to empower developers to be autonomous and flexible. We continuously work to create self-service models for our tech in close collaboration with development teams.
We care deeply about our culture and believe in:
An inclusive and diverse workplace
Continuous improvement of everything we do
Automation and coding as much as possible
Collaboration and blame-free respectful problem solving
Asking for help and sharing ideas openly

We engineer and provide the shared infrastructure platform serving all of our games, as well as environments for developers and supporting tech like observability, log management, and event transport. This includes everything from working in our data centers, writing code for full stack orchestration and automation, troubleshooting distributed systems and resolving production incidents.
For this role you will join the our Applications engineering team to work on observability. You will engineer and provide the systems that are the eyes and ears of the infrastructure that processes over 100 billion events per day.
Example of what you will work on:
- Our monitoring pipeline for platform and infrastructure meters (kafka, collectors, adapters, influxdb, stackdriver, OpenTSDB),
- Alerting and notification (Nagios, PagerDuty, Monitoring in the )
- Troubleshooting and instrumentation solutions (NetData, atop, or other such solutions)
- Log management (Elk, Graylog, StackDriver)
Skills to create thrills
Monitoring Pipelines built with Kafka, Collectors, Adapters
Alerting and notifications (Nagios, PagerDuty, Incident response)
Troubleshooting and instrumentation solutions (Netdata, Pixie)
Knowledge of monitoring systems like OpenTSDB, InfluxDB, Graphite, log management systems like Graylog or ELK
Orchestration frameworks like SaltStack, Ansible, Puppet, Terraform etc
Comfortable with the Linux command line & its performance tools
Able to communicate proficiently in written and spoken English

We think that you are a curious, humble, driven, collaborative, and responsible person who loves to work with infrastructure as code.