Site Reliability Engineering (SRE) Lead – AWS
UST Global
Date: 1 day ago
City: Hyderabad
Contract type: Full time
-
10 - 15 Years
1 Opening
Hyderabad
Role description
Job Title: Site Reliability Engineering (SRE) Lead
Experience Required
- Overall: 15+ years in software engineering, systems engineering, or related fields and team management experience.
- Cloud Expertise: Minimum 5 years of hands-on experience in AWS environments.
Role Overview
As the SRE Lead, you will own the reliability strategy for mission-critical systems and lead a team of engineers to ensure high availability, scalability, and performance. You will combine technical expertise with leadership skills to drive operational excellence and foster a culture of reliability across engineering teams.
Key Responsibilities
- Leadership & Strategy
- Define and implement SRE best practices across the organization.
- Proven expertise in production support, resilience engineering, disaster recovery (DCR), automation, and cloud operations
- Mentor and guide a team of SREs, fostering growth and technical excellence.
- Collaborate with senior stakeholders to align reliability goals with business objectives.
- Reliability & Performance
- Establish SLIs, SLOs, and SLAs for critical services and ensure adherence.
- Drive initiatives to improve system resilience and reduce operational toil.
- Excellent in designing systems that detect and remediate issues without manual intervention – Self Healing systems, Runbook automation
- Exposure to tools like Gremlin, Chaos Monkey, AWS FIS to simulate outages and improve fault tolerance
- Incident Management
- Act as the primary point of escalation for critical production issues and lead major incident response, root cause analysis, and postmortems.
- Perform detailed post-incident investigations to identify underlying causes. Document findings and share learnings to prevent recurrence.
- Implement preventive measures and continuous improvement processes.
- Observability
- Champion monitoring, logging, and ing strategies using tools like Prometheus, Grafana, ELK, and AWS CloudWatch.
- Build real-time dashboards to visualize system health and reliability metrics.
- Configure intelligent ing based on anomaly detection and thresholds.
- Combine metrics, logs, and traces to enable root cause analysis and reduce Mean Time to Resolution (MTTR).
- Knowledge of AIOps or ML-based anomaly detection for proactive reliability management.
- Collaboration
- Work closely with development teams to integrate reliability into application design and deployment
- Promote a culture of shared responsibility for uptime and performance across engineering teams.
- Strong interpersonal and communication skills for technical and non-technical audiences.
Required Skills
- Deep expertise in AWS services (EC2, ECS/EKS, RDS, S3, Lambda, VPC, FIS, Cloudwatch).
- Hands-on experience with Infrastructure as Code (Terraform, Ansible, CloudFormation).
- Advanced knowledge of monitoring and observability tools.
- Excellent leadership, communication, and stakeholder management skills.
Skills
SRE, AWS, Terraform, Ansible, Prometheus, Incident management, Disaster Recovery.
About UST
UST is a global digital transformation solutions provider. For more than 20 years, UST has worked side by side with the world’s best companies to make a real impact through transformation. Powered by technology, inspired by people and led by purpose, UST partners with their clients from design to operation. With deep domain expertise and a future-proof philosophy, UST embeds innovation and agility into their clients’ organizations. With over 30,000 employees in 30 countries, UST builds for boundless impact—touching billions of lives in the process.How to apply
To apply for this job you need to authorize on our website. If you don't have an account yet, please register.
Post a resumeSimilar jobs
Application Consultant
Capgemini,
Hyderabad
18 hours ago
Hyderabad
Application Consultant
Job Description
Application Consultants understand the client business process, current and future, and map it to the technologies being used, giving a demarcation of what can be achieved through a standard implementation and what would need customizations or extensions to be done to the application. They may functionally customize the application, partner with application developers to design...
Sr Business Analyst / Business Lead
Broadridge,
Hyderabad
2 days ago
At Broadridge, we've built a culture where the highest goal is to empower others to accomplish more. If you’re passionate about developing your career, while helping others along the way, come join the Broadridge team.
Requirements / Qualifications
4-8 years' Experience with Business development.
Market Research: Lead (own, plan, manage, and deliver) assigned Market Research, Strategy projects, which may include...
Suse Linux Administrator
Capgemini,
Hyderabad
2 days ago
Hyderabad
Suse Linux Administrator
Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues around the world, and where you’ll be able to reimagine what’s possible. Join us and help the world’s leading organizations unlock the value of...