Job Description
COMPANY DESCRIPTION
Mojo Trek, an Inc. 5000 company, delivers an unparalleled recruitment experience, grounded in transparency and integrity. From midsize technology innovators to Fortune 50 powerhouse corporations, we help our clients build technology teams that make a difference, push the change forward, and develop software critical to their success.
We are seeking a Senior Site Reliability Engineer to join our Infrastructure team. In this role, you will be responsible for the reliability, scalability, and performance of our global cloud platforms. You will bridge the gap between software engineering and systems operations, with a heavy focus on building world-class observability.
The ideal candidate has deep expertise in both the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) and Datadog. You will lead the strategy for correlating metrics, logs, and traces to reduce Mean Time to Recovery (MTTR) and improve our overall system visibility.
Key Responsibilities
- Design, implement, and maintain scalable observability pipelines using the Grafana LGTM stack and Datadog.
- Build and optimize high-cardinality metrics storage in Mimir and distributed tracing workflows in Tempo.
- Develop advanced Loki log aggregation strategies to provide cost-effective, high-speed troubleshooting capabilities.
- Manage and tune Datadog agents, APM, and synthetic monitoring to ensure comprehensive coverage of our microservices architecture.
- Lead incident response and post-mortem root cause analysis, using observability data to drive architectural improvements.
- Automate infrastructure provisioning and configuration management using Terraform, Pulumi, or similar IaC tools.
- Mentor junior engineers on SRE best practices, including SLIs, SLOs, and error budgets.
- Collaborate with development teams to instrument applications using OpenTelemetry (OTel) for seamless data ingestion into both Grafana and Datadog.
Required Qualifications
- 7+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Proven track record of managing production-grade Grafana environments, specifically utilizing the LGTM components (Loki, Tempo, Mimir).
- Extensive experience with Datadog, including APM, Log Management, and Dashboarding.
- Strong proficiency in container orchestration, specifically Kubernetes (EKS, GKE, or self-managed).
- Deep understanding of the "Three Pillars of Observability" and how to correlate them across different platforms.
- Experience with Infrastructure as Code (Terraform) and CI/CD pipelines (GitHub Actions, GitLab CI, or Jenkins).
- Strong programming skills in Go, Python, or Rust.
- Experience with Prometheus, Alertmanager, and PromQL/LogQL.
Preferred Skills
- Experience migrating workloads or observability data between Datadog and self-hosted Grafana stacks.
- Knowledge of eBPF for deep system observability.
- Contributions to open-source observability projects.
- Experience managing large-scale cloud costs and optimizing observability spend.