Job Summary
Incident Management: Lead and manage high-priority incident responses with a sense of urgency and efficiency. Analyze troubleshoot and resolve complex system issues spanning across multiple technology stacks.FMEA: Perform Hot spot analysis and Service map Analysis to Identify potential risks and Vulnerabilities for FaultsNFRs and Quality Gates: Refine and Drive NFRs and Quality gates including Safe release patters and Reliability and Resiliency requirements in Releases.Responsibilities
Full-Stack Expertise: Use extensive knowledge of both front-end and back-end technologies to understand and debug system issues quickly. Implement solutions that encompass all layers of the application and infrastructure stack.Enterprise Systems Knowledge: Use deep understanding of enterprise-level network and middleware technologies to find root causes of incidents and provide sustainable solutions.Problem Management: Drive continuous improvement initiatives by analyzing incident trends finding recurring issues and implementing initiative-taking measures to enhance system reliability and performance. Review and refine SRE standards and processes focusing on incident response and reducing toil.Collaboration: Work closely with development operations and other IT teams to ensure cohesive and effective incident management. Facilitate post-incident reviews and share learnings across the organization.Automation and Tooling: Develop and implement automation tools and scripts to streamline Diagnostic package incident response and resolution processes. Provide feedback on Enhanced monitoring and alerting systems to detect issues proactively.Documentation: Keep detailed documentation of incidents resolutions and system changes to ensure knowledge sharing and compliance with IT governance standards.Observability & Self Heal: Provide leading indicators and Drive Observability maturity. Drive Development Self heal capabilities with Various teams.Assess and measure performance resiliency and reliability of apps with focus on Observability and monitoring practices like SLAs SLOs etc
Experience in Dynatrace
Configure monitoring and logging of systems in order to obtain better visibility
Help design processes that automatically evaluate system SLA
Be proactive identify and remediate issues before SLAs are violated
Tool-agnostic and approach-centric
Required Knowledge/Skills Education and Experience
8-12 years in the software industry with 4+ years in an SRE or DevOps role
Profound knowledge of full-stack technologies legacy servers middleware cloud platforms (AWS) containerization technologies (e.g. Docker Kubernetes) and databases (SQL NoSQL)
Experience with container management and infrastructure monitoring tools
Expertise in enterprise network architectures protocols middleware technologies and API management Tool / process
Programming skills in high-level languages like Python Java Ruby or JavaScript
Automation experience with scripting and API development (e.g. Ansible Terraform Shell Python)
2+ years with observability tools and containerization
Preferred Knowledge/Skills Education and Experience
Experience with AWS Terraform CloudFormation and incident tracking tools.
Certifications in AWS Observability and monitoring tools
Experience with log management tools
Ensure system reliability getting systems back to steady-state as quickly as possible
The Cognizant community:
We are a high caliber team who appreciate and support one another. Our people uphold an energetic, collaborative and inclusive workplace where everyone can thrive.
About us:
Cognizant is one of the world's leading professional services companies, transforming clients' business, operating, and technology models for the digital era. Our unique industry-based, consultative approach helps clients envision, build, and run more innovative and efficient businesses. Headquartered in the U.S., Cognizant (a member of the NASDAQ-100 and one of Forbes World’s Best Employers 2024) is consistently listed among the most admired companies in the world. Learn how Cognizant helps clients lead with digital at www.cognizant.com
Our commitment to diversity and inclusion:
Cognizant is an equal opportunity employer that embraces diversity, champions equity and values inclusion. We are dedicated to nurturing a community where everyone feels heard, accepted and welcome. Your application and candidacy will not be considered based on race, color, sex, religion, creed, sexual orientation, gender identity, national origin, disability, genetic information, pregnancy, veteran status or any other protected characteristic as outlined by federal, state or local laws.
Disclaimer:
Compensation information is accurate as of the date of this posting. Cognizant reserves the right to modify this information at any time, subject to applicable law.
Applicants may be required to attend interviews in person or by video conference. In addition, candidates may be required to present their current state or government issued ID during each interview.