Don’t just follow the industry; define it. Tetrate, creators of Envoy Gateway and Envoy AI Gateway, and architects of industry-standard security practices (SPIFFE and NGAC), is building a world-class field engineering team. Are you ready to build applications that power the global economy, Fortune 150 companies, and protect national security? We’re looking for a Technical Lead – SRE who will apply cloud operations practices across our hybrid environments, improve customer outcomes, and own the operational roadmap.
Tetrate seeks an outcome driven, technically adept Technical Lead – SRE to champion our enterprise customers, demonstrating how we solve Layer 7 challenges and security vulnerabilities.
Responsibilities:
- Operational Excellence & Incident Management
- Improve MTTD and MTTR through enhanced monitoring, logging, and alerting.
- Establish SRE practices, build operational dashboards, and maintain runbooks.
- Enhance Customer experience working with CRE team members with SRE best practices.
- Use tools like Prometheus, Grafana, Datadog, OpenTelemetry, and Elastic Stack for observability.
- Automate health checks and incident response with Terraform, Ansible, Helm, and Kubernetes.
- Customer Engagement & Architecture Review
- Analyze customer architectures and operational practices.
- Identify themes from escalations and map them to architectural gaps or operational improvements.
- Provide tailored recommendations and help implement improvement plans for customers’ environments.
- Develop standard operating procedures (SOPs) for deployment, maintenance, and incident handling in customer environments.
- Provide proactive guidance on performance tuning, disaster recovery (DR) strategies, and scaling mechanisms.
- Establish secure connectivity and seamless integration between the hosted management plane and customer environments.
- Lead root cause analysis (RCA) and propose long-term solutions for recurring issues.
- Product & Hybrid Architecture Optimization
- Apply cloud practices (CI/CD, GitOps) to hybrid and on-prem environments.
- Apply Cloud Best Practices (e.g., AWS Well-Architected Framework) to enhance both internal product development and customer environments.
- Build custom plugins and automation scripts to meet customer needs and extend Flagship product capabilities.
- Collaborate with product teams to implement metrics improvements, UI enhancements, and alerts for hosted solutions.
- Ownership of Hosted Operations
- Develop and execute an operational plan for hosted environments, including monitoring, alerts, and product improvements.
- Take ownership of getting on-prem customers to implement hosted operational improvements, ensuring alignment with hosted best practices.
- Collaboration and Leadership
- Partner with developer, platform, and security teams to align operational goals with product roadmaps.
- Mentor other engineers on cloud-native operations best practices, focusing on Zero Trust principles.
- Drive continuous improvement through automation, Shift-Left initiatives, and SRE (Site Reliability Engineering) methodologies.
Required Skills:
- 8+ years of experience in Cloud Operations, SRE, or DevOps roles.
- Strong hands-on experience with Kubernetes, Istio, Envoy, Gateway, Load Balancers and hybrid architectures.
- Hands-on experience with cloud platforms such as AWS, GCP, or Azure and knowledge of hybrid/cloud-native architectures.
- Strong analytical and troubleshooting skills with experience in Postgres, Elastic DB, and GraphQL queries.
- Experience building CI/CD pipelines with tools like GitHub Actions, or ArgoCD.
- Familiarity with on-prem deployments and integration with public cloud hosted services.
- Familiarity with LDAP, OIDC, SAML authentication and security configurations.
- Ability to collaborate with customers and cross-functional teams to drive operational improvements.
- Experience with CI/CD, GitOps practices, and networking concepts.
- Prior exposure to multi-cloud deployments and hybrid architectures with VM and container-based workloads.
- Ability to communicate complex technical concepts clearly to both technical and non-technical stakeholders.
- Prior experience interacting directly with enterprise customers for operational troubleshooting and architecture reviews.
- 5+ years of experience in Python, Golang (Go),
- 3-5+ years of Bash / Shell Scripting.
- 1-2 years of Javascript or Typescript.
- 2-3 years of Infrastructure as Code tools like Terraform
- Good familiarity with YAML/JSON.
If you’re looking for a job, this isn’t it. If you’re ready to be part of something larger than yourself, connect with us.
Locations: We’re a fully distributed team with a global presence in 15 countries. While this role requires North American timezone coverage, we welcome exceptional talent from anywhere. Visa sponsorship (H1B) is supported.