Senior Site Reliability Engineer (SRE) - Core Messaging Infrastructure - STACKIT (m/f/d) Job Details

Apply now »

Introduction text

Schwarz Digits creates the technological foundation for digital sovereignty in Europe. As the IT and digital division of the Schwarz Group, we develop and manage the IT infrastructures for the retail divisions Lidl and Kaufland, as well as Schwarz Production and PreZero. At the same time, we operate as an independent provider in the external market to support companies across Europe in their digital transformation. We bundle our core services in the areas of Cloud, Cyber Security, Data & AI, Communication, and Workspace.

Join us and contribute to digital sovereignty in Europe. With us, you will work at the intersection of agility and security: You will benefit from fast decision-making processes, enjoy genuine creative freedom in your projects, and be able to build upon the stable foundation of the Schwarz Group.

We are looking for a Senior Engineer to build, scale, and own the central nervous system of our cloud infrastructure: a highly resilient, high-throughput message and event platform. As our engineering organization scales rapidly, we are transitioning to a real-time, event-driven architecture to ensure seamless communication between the control plane components of all our products. You will empower dozens of product teams by providing an outstanding developer experience, enabling them to seamlessly publish and consume millions of events per day.

Your Tasks

You design, deploy, and manage highly available, distributed message broker clusters (such as Apache Kafka, Solace, or NATS) across multiple data centers.
You ensure the reliability, performance, and fault tolerance of the messaging infrastructure by implementing robust disaster recovery and failover strategies and tune operating system configurations for low-latency delivery.
You automate the provisioning, scaling, and configuration of messaging clusters.
You build comprehensive monitoring, alerting, and logging dashboards to track cluster health, throughput, and latency.
You define best practices for application developers and build a self-service platform that makes it easy for internal teams to independently configure their integrations.

Your Profile

Du bringst fundierte Erfahrung in der Verwaltung großer verteilter Systeme in der Produktion mit, idealerweise aus Bereichen wie Site Reliability Engineering oder Platform Engineering.
Du hast tiefgehende, praktische administrative Erfahrung mit Enterprise-Brokern wie Apache Kafka oder Solace und bringst Erfahrung in der Verwaltung von Infrastruktur auf Kubernetes unter Verwendung des Operator-Patterns sowie in der Verwaltung von virtuellen Maschinen mit Tools wie Ansible mit.
Du programmierst fließend in Python, Go oder Bash und hast ein starkes Verständnis für Linux-Performance-Tuning und Netzwerkprotokolle (wie Transmission Control Protocol/Internet Protocol oder Domain Name System).
Idealerweise hast du ein tiefgreifendes Verständnis von Mustern der ereignisgesteuerten Architektur (Event-Driven Architecture Patterns) und Event-Streaming-Konzepten, um beim Design skalierbarer Echtzeit-Datenpipelines zu unterstützen.
Dein Englisch, idealerweise in Kombination mit Deutsch, bildet die Grundlage für eine erfolgreiche Kommunikation in unseren internationalen, agilen Teams.

3944

Apply now »