Site Reliability Engineer – AI, Analytics & Data

H&M Group

STOCKHOLM

OBS! Ansökningsperioden för denna annonsen har passerat.

Arbetsbeskrivning

Company Description
H&M Group is on a journey to meet and exceed our customers' expectations today and tomorrow. Through collaboration, innovation, and technology we challenge ourselves and the industry. To cater to the individual needs and desires of our millions of customers, our tech organisation delivers solutions for the entire value chain for all our brands.
We are accelerating digitalisation and to stay relevant, we need to ensure we have strong leaders in place to bring our best capabilities, innovation ideas and talented technologists to support the transformation of H&M Group. 
We take pride in our history of making fashion accessible to everyone and our ambition for tomorrow is to make fashion even more sustainable, inclusive, and welcoming.
Job Description
Site Reliability Engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics. They split their time between operations and developing systems and software that helps increase systems reliability and performance. SRE automate redundancy, and they automate manual tasks that they can turn into programmatic tasks to keep the stack up and running. Site Reliability Engineers are able to oversee software and performance of the full technology stack.
At H&M Group, SRE is an area within AI, Analytics and Data (AIAD) Domain. Within our area we have two teams SRE Core and SRE Operations (Ops).
SRE Core works in close collaboration with the Platform and Data team majorly focusing on best practices, frameworks, and automation on a multi cloud tech stack to enable teams to run stable data products.
SRE Ops works closely with the data team and its major focus is on maintaining stability & reliability of products on cloud and on-premises.

Responsibilities:
Infrastructure Automation and Configuration Management:
Develop and maintain automation tools, scripts, and configuration management systems to streamline deployment, provisioning, and monitoring processes.
Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or Kubernetes to manage infrastructure effectively.
Collaborate with development and operations teams to automate build, test, and deployment processes for efficient software releases.

Reliability Engineering and Resilience:
Design and implement systems and processes to enhance the reliability and resilience of the infrastructure.
Continuously improve system reliability by analyzing incident trends, identifying areas for improvement, and implementing preventative measures.

System Monitoring and Incident Response:
Develop and manage monitoring tools and systems to track the health, performance, security, and availability of software applications, infrastructure components, and services.
Set up alerts, dashboards, and metrics to proactively detect and respond to system outages, service disruptions, and performance incidents.
Investigate and diagnose the root cause of incidents and work towards their resolution in a timely manner.

Continuous Improvement and Collaboration:
Drive a culture of continuous improvement by identifying areas for automation, efficiency, and operational excellence.
Document procedures, incidents, and best practices to facilitate knowledge sharing and improve team efficiency.
Stay abreast of industry trends, emerging technologies, and best practices to propose innovative solutions that enhance system reliability and performance.
Collaborate closely with cross-functional teams, including developers, system administrators, and network engineers, to ensure smooth operation of systems.

Qualifications
Bachelor's degree in computer science, Engineering, or a related field (or equivalent experience) with 3+ years of IT experience
 Proficient in scripting/programming languages such as Python, Bash.
Experience with cloud platforms (Google Cloud Platform & Azure preferred)
Experience in DevOps practice, CI/CD and monitoring tools
Experience with automation tools and configuration management frameworks such as Terraform, Puppet or Ansible
Strong troubleshooting and problem-solving skills with a keen attention to detail
Excellent communication and collaboration skills to work effectively in a cross-functional team environment
Strong experience in system administration, infrastructure management, or site reliability engineering
Ability to thrive in a fast-paced, agile environment and handle multiple priorities

Tech Stack (in a flash):
GCP, Azure, Python, Terraform, Git, SQL, Bash, Power Bi, Grafana, Zabbix, Prometheus, Docker, Kubernetes, Linux, PowerShell, ServiceNow, Dbt, Atlassian
Additional Information
Working with tech at H&M Group
Shaping the future of fashion with people, data, and tech. The fashion and retail industries are going through a transformation, driven by customers' technology and sustainability expectations. At H&M Group, we want to shape the future of fashion and lifestyle by harnessing the power of smart tech and data. With our 74-year history of innovation, we understand the need to collaborate and co-create with engineers and tech specialists around the world to achieve our vision.
What we offer!
You are joining a unique value-driven culture, a large tech network and community where you can be yourself. Besides the obvious perks such as staff discount card, flexible work life, learning communities, wellness benefits, parental benefits etc. There are endless opportunities to experiment and grow in any direction that you want, and when you grow, we grow. Being a major player gives us countless opportunities to make a real impact and shape the future.
We are committed to create an inclusive & diverse workplace with a culture that is dynamic and innovative.
Sounds interesting?
This is a full-time position based in Stockholm. We do not accept applications through email due to GDPR

Sammanfattning

Arbetsplats: H&M Group STOCKHOLM
1 plats
Tills vidare
Heltid
Fast månads- vecko- eller timlön
Publicerat: 30 november 2023
Ansök senast: 30 december 2023

Postadress

Liljeholmsstranden 5
STOCKHOLM, 11743

Site Reliability Engineer – AI, Analytics & Data

Arbetsbeskrivning

Sammanfattning

Postadress

Liknande jobb

DevOps Engineer

Lyftet AB söker konsult till kund i Göteborg Test Rig Development Engineer

SQE och SQM till uppdrag inom tillverkande industri

Supplier Quality Assurance