Senior Site Reliability Engineer – AI-Driven Cloud Reliability
SenecaGlobal
Location:
Office 3A, 3rd Floor, Orbit,
Plot Number 30/C, Survey Number 83/1, Raidurg
Hyderabad – 500081, Telangana, India
Job Description
We are looking for a highly experienced Senior Site Reliability Engineer to independently own, modernize, and strengthen cloud-native reliability across enterprise platforms.
This role requires deep hands-on expertise in SRE, DevOps, Azure cloud, Kubernetes, CI/CD, Infrastructure as Code, DevSecOps, observability, automation, and AI-assisted engineering practices.
The ideal candidate will be a senior independent contributor who can drive reliability strategy, improve production stability, reduce operational toil, strengthen deployment practices, and responsibly use AI to accelerate platform engineering and incident management workflows.
Key Responsibilities
- Own and drive SRE practices across cloud-native platforms, infrastructure and applications.
- Lead production readiness reviews, reliability assessments, architecture reviews and operational risk evaluations.
- Troubleshoot complex production and non-production issues across infrastructure, CI/CD pipelines, branching strategies, cloud platforms, networking and application performance.
- Improve system availability, scalability, resilience, latency and recovery through automation, proactive monitoring and self-healing solutions.
- Provide guidance to engineering, platform and cloud teams on SRE, DevOps and reliability best practices.
- Drive proactive DevSecOps adoption and integrate security practices throughout the development and operations lifecycle
Site Reliability Engineering & Strategy
- Own and drive SRE practices across cloud-native platforms, infrastructure, and applications.
- Define reliability standards, SLIs, SLOs, error budgets, runbooks, post-mortems, SOPs, and incident response workflows.
- Lead production readiness, reliability reviews, architecture reviews, and operational risk assessments.
- Troubleshoot complex production, infrastructure, CI/CD, Kubernetes, cloud, networking, and performance issues.
- Improve availability, scalability, latency, resilience, and recovery through automation, proactive engineering, and self-healing practices.
- Guide engineering, platform, and cloud teams on modern SRE, DevOps, and reliability practices.
Cloud Platform, IaC & Automation
- Design and optimize secure, scalable, highly available, and cost-efficient cloud infrastructure.
- Build and govern IaC using Terraform, ARM/Bicep, Ansible, CloudFormation, Pulumi, or equivalent tools.
- Mandatory hands-on experience with ARM/Bicep.
- Automate operational tasks using PowerShell and Azure CLI; experience with Python, Bash, or Go is preferred.
- Build reusable automation frameworks, IaC modules, deployment templates, runbooks, and platform standards.
- Review infrastructure, IAM, cloud, network, Kubernetes, and pipeline configurations for reliability, security, scalability, and cost efficiency.
- Build self-service capabilities and internal developer platform practices to improve engineering velocity.
CI/CD, Containers & Release Reliability
- Architect and optimize CI/CD pipelines using Azure DevOps, Jenkins, GitHub Actions, GitLab CI/CD, ArgoCD, or similar tools.
- Mandatory hands-on experience with Azure DevOps.
- Design reliable release strategies with mandatory hands-on experience in blue-green deployments, along with canary, rolling, feature flags, and rollback automation.
- Lead containerization and orchestration using Docker, Kubernetes, Helm, Azure Container Apps, AKS, EKS, or GKE.
- Mandatory hands-on experience with Azure Container Apps.
- Improve deployment reliability, release governance, lead time, change failure rate, and recovery time.
DevSecOps & Secure Engineering
- Integrate security into CI/CD, cloud infrastructure, containers, Kubernetes, and release workflows.
- Implement SAST/DAST, secrets management, vulnerability scanning, policy enforcement, and compliance automation.
- Mandatory hands-on experience with SonarQube administration and upgrades, including plugin management, quality gates, and version upgrades.
- Govern IAM, Azure network security, Key Vault, certificates, firewalls, audit readiness, and policy compliance.
- Identify and remediate infrastructure, Kubernetes, IAM, and pipeline security risks.
Observability & Incident Response
- Design observability solutions covering metrics, logs, traces, dashboards, alerts, SLIs, SLOs, and incident workflows.
- Mandatory hands-on experience with Splunk, New Relic, Grafana, and Datadog.
- Create and maintain PromQL queries, Grafana dashboards, Splunk searches, Datadog monitors, and New Relic alert policies.
- Use observability insights to detect reliability risks, performance bottlenecks, error patterns, and capacity issues.
- Lead incident response, root cause analysis, post-mortems, escalation workflows, and preventive action tracking.
- Familiarity with Prometheus, CloudWatch, Azure Monitor, ELK/EFK, or OpenTelemetry is an added advantage.
Azure Platform Engineering – Mandatory Depth
- Strong hands-on experience across the Azure stack, including Azure DevOps, ARM/Bicep, Azure Container Apps, Azure VNets, VNet Peering, DNS, Load Balancing, Key Vault, Certificates, Azure Firewalls, and Serverless compute.
- Design secure and highly available Azure network architectures, including hub-spoke topology, private endpoints, NSG governance, and controlled traffic flows.
- Ensure Azure environments are reliable, scalable, secure, compliant, and cost-optimized.
AI-Driven SRE & Engineering Productivity
- Use AI-assisted engineering tools such as GitHub Copilot, Claude Code, Cursor, or similar tools to accelerate SRE and DevOps workflows.
- Generate, review, and refine Terraform modules, ARM/Bicep templates, Helm charts, Kubernetes manifests, Dockerfiles, Ansible playbooks, and CI/CD YAML pipelines using AI support.
- Use AI to identify misconfigurations, insecure IAM policies, Kubernetes risks, scalability bottlenecks, reliability gaps, and infrastructure anti-patterns.
- Apply AI for log analysis, incident summarization, root cause hypothesis generation, runbook drafting, post-mortem creation, and leadership updates.
- Generate observability artifacts such as PromQL queries, Grafana configurations, Splunk searches, Datadog queries, New Relic policies, and CloudWatch queries using AI assistance.
- Author and review Solution Design Documents for SRE, cloud platform, DevOps, automation, and reliability initiatives.
- Establish safe AI usage practices, ensuring no secrets, credentials, proprietary code, client data, or confidential information is shared with public AI tools.
- Validate all AI-generated recommendations before applying them to production or security-sensitive environments.
Required Skills and Experience
- 10+ years of overall IT experience with strong background in SRE, DevOps, cloud engineering, platform engineering, automation, and production operations.
- 5+ years of experience designing and operating enterprise-grade cloud-native platforms and reliability engineering practices.
- Strong hands-on experience with Azure, including Azure DevOps, ARM/Bicep, Azure Container Apps, VNets, VNet Peering, Key Vault, DNS, load balancing, firewalls, certificates, and serverless services.
- Deep expertise in CI/CD tools such as Azure DevOps, Jenkins, GitHub Actions, GitLab CI/CD, ArgoCD, or similar platforms.
- Strong experience with Terraform, ARM/Bicep, Ansible, CloudFormation, Pulumi, or equivalent IaC tools.
- Strong knowledge of Docker, Kubernetes, Helm, Azure Container Apps, container registries, and cloud-native deployment patterns.
- Mandatory hands-on experience with SonarQube administration, including quality gates, plugin management, and version upgrades.
- Mandatory experience with observability platforms: Splunk, New Relic, Grafana, and Datadog.
- Strong scripting experience using PowerShell, Azure CLI, Python, Bash, Go, or similar languages.
- Strong understanding of DevSecOps, cloud security, IAM, secrets management, vulnerability scanning, and compliance automation.
- Strong knowledge of cloud networking, including Azure VNets, VNet Peering, DNS, load balancing, certificates, firewalls, distributed systems, and high-availability architecture.
- Mandatory experience with blue-green deployments, along with canary releases, rolling deployments, feature flags, and rollback automation.
- Strong understanding of SLIs, SLOs, error budgets, incident response, post-mortems, and production reliability practices.
- Proven ability to independently lead technical investigations, troubleshoot critical production issues, and drive reliability improvements.
- Strong working knowledge of AI-assisted engineering tools and the ability to apply AI responsibly across SRE, DevOps, automation, observability, and incident response workflows.
Experience & Qualifications
- 10 – 12 Years of experience
- BE/B.Tech/M.Tech/MCA
How to Apply
To apply, submit your CV and contact information to [email protected].
About SenecaGlobal
Founded in 2007, SenecaGlobal is a global leader in software development and management. Services include software product development, application software development, enterprise cloud and managed services, quality assurance and testing, security, operations, help desk, technology advisory services and more. The company’s agile team consists of world-class information technologists and business executives across industries, ensuring that we provide clients with a strong competitive advantage.
SenecaGlobal is headquartered in Chicago, Illinois, and has a state-of-the-art software development and management center in Hyderabad, India. The company is certified as a Great Place to Work and is ISO 9001 certified for quality and ISO 27001 certified for security.
How to apply
To apply for this job you need to authorize on our website. If you don't have an account yet, please register.
Post a resumeSimilar jobs
Assistant Vice President, Fund Accounting - Opportunistic Credit
AI Data engineer
Enterprise Architect – Emerging Technologies & Innovation (Python & AI)