Previous Work Experience 5+ years
Role Responsibilities GCP, GKE at scale, New Relic, Terraform, CI/CD, AI/ML platforms, incident
ownership
Own day-2 production operations of a large-scale, AI-first platform on GCP.
• Run, scale, and harden GKE-based workloads integrated with a broad set of
GCP managed services
(data, messaging, AI, networking, and security).
• Define, implement, and operate SLIs, SLOs, and error budgets across platform
and AI services.
• Build and own New Relic observability end-to-end (APM, infrastructure, logs,
alerts, dashboards).
• Improve and maintain CI/CD pipelines and Terraform-driven infrastructure
automation.
• Operate and integrate Azure AI Foundry for LLM deployments and model
lifecycle management.
• Lead incident response, postmortems, and drive systemic reliability
improvements.
• Optimize cost, performance, and autoscaling for AI and data-intensive
workloads.
Requirements
• 6+ years of hands-on experience in DevOps, SRE, or Platform Engineering
roles.