Senior CloudOps Engineer
Onit, Inc. is looking for a Senior CloudOps Engineer to join our team in Pune to help manage and maintain a diverse infrastructure across numerous geographical locations. To be successful in this role, great people skills are a must, as well as a passion for technology. The individual we seek is bright, creative and a problem solver. You must be able to multi-task in a fast-paced environment and be a self-starter with the ability to work independently.
Responsibilities
Responsible for optimizing performance, ensuring security, and driving innovation in our cloud environment while responding to infrastructure and security alerts in a 24x7x365 operation.
Create automation, runbooks, and playbooks to help others support the infrastructure
Troubleshoot infrastructure and application
-level issues and collaborate with support specialists and Cloud Operations / SRE
Write and present weekly report highlighting the previous week’s alerts, with detailed analysis, resolution and any impact to SLA.
Monitor performance and capacity of Onit systems.
Monitor for hardware, software and environmental alerts or malfunctions.
Monitor security alerts from multiple sources.
Triage and troubleshoot problems as they arise, following runbooks and standard operating procedures.
Track all issues from start to finish and document in detail all resolutions, across trouble ticketing system and engineering runbooks.
Escalate issues to InfraOps/Devops engineers and Onit management.
Ready to work in shifts.
Requirements
Bachelor’s degree in Computer Science or equivalent experience is required.
4+ years’ experience with Red Hat Enterprise or Amazon Linux 2023 is required.
3+ years hands-n experience with AWS (EC2, S3, RDS, VPC, Cloudwatch, CloudTrail, IAM, EKS, ECS, Security, etc.)
A solid understanding of the components that make up production systems (Memory, CPU, Disk space, Disk i/o, Network i/o, etc.) is required.
Strong experience with monitoring, alerting, and log aggregation tools: Datadog, AWS CloudWatch, PagerDuty, Statuspage.
Experience with SIEM/event correlation systems like Elastic, Splunk, ELK, etc. required.
Strong understanding of AWS security and monitoring and experience implementing best practices.
Ability to read and interpret application server logs, outputs, CloudTrail and other critical logging output
Experience working with Relational Database such as Postgres, AWS RDS is a plus
Hands-on experience working in Kubernetes is a plus
Experience with Enterprise Web applications in production
Experience with a programming language such as Python a plus
Excellent troubleshooting skills required.
Ensure resource availability and allocation
Excellent written and verbal communication skills required.
Experience using Git (GitLab a plus), CI/CD pipelines (eg: Jenkins)