Heredia, CRI
7 days ago
SRE - Backup & Recovery Storage DevOps
**Introduction** About IBM IBM is a global technology and innovation company. It is the largest technology and consulting employer in the world, with presence in 170 countries. The diversity and breadth of the entire IBM portfolio of research, consulting, solutions, services, systems and software, unusually distinguishes IBM from other companies in the industry. Over the past 100 years, a lot has changed at IBM, in this new era of Cognitive Business, IBM is helping to reshape industries as diverse as healthcare, retail, banking, travel, manufacturing, and many more, by bringing together our expertise in Cloud, Analytics, Security, Mobile, and the Internet of Things. We like to say, "be essential." We are changing how we craft. How we collaborate. How we analyze. How we engage. Join the next generation of innovators, inventors and entrepreneurs who are crafting the very way the world works. We want the brightest minds doing work that encourages, in an environment where growth is supported. IBMers get to discover their potential, so they’re inspired to build breakthroughs that help our clients succeed. We’re building teams with dynamic strengths with people who want their ideas to matter. Join us — you’ll be proud to call yourself an IBMer. Our Culture: IBM is committed to crafting a diverse environment and is proud to be an equal opportunity employer. You will receive consideration for employment without regard to your race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. Business Unit Introduction IBM Cloud is a one-stop shop which provides all the cloud solutions & cloud tools the industries need. IBM Cloud portfolio includes infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) offered through public, private and hybrid cloud delivery models, in addition to the components that make up those clouds. IBM Cloud ensures seamless integration into public and private cloud environments. The infrastructure is secure, scalable, and flexible, providing customized enterprise solutions that have made IBM Cloud the Hybrid Cloud Market leader with our market leading IAAS and PAAS Platforms. The IBM Cloud platform is the public cloud offering from IBM providing services to global enterprises. IBM Cloud is the Cloud for Smarter Business, built on Open Technology with Developer Tools and supports solutions by Industry. We run the services and workloads from Watson, Blockchain, Services, Security, and IoT. Ready to help drive IBM's success in the Cloud market? This is your chance to research and learn new Cloud related technology products and services, as well as to design and implement quick Cloud based prototypes while advancing your career in leading edge technology. Who you are: As a Site Reliability Engineer (SRE) in the IBM Cloud Infrastructure organization, you will be responsible for ensuring the reliability, scalability, and operational efficiency of IBM Cloud's storage services. You will work closely with development teams, SRE peers, and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities. This role involves writing code, building automation, troubleshooting production issues, and improving overall service reliability. **Your role and responsibilities** Key Responsibilities: Reliability & Scalability · Design, build, and maintain highly available, distributed services with a focus on scalability, security, and performance. · Develop and implement Kubernetes and OpenShift-based solutions to manage containerized applications at scale. · Create auto-scaling, load balancing, and failover strategies to ensure seamless service availability. Monitoring & Observability · Design, implement, and manage monitoring solutions to gain insights into system health and performance. · Create and maintain intuitive dashboards that provide real-time visibility into critical metrics. · Set up proactive alerting mechanisms to detect and resolve issues before they impact end users. Automation & Infrastructure as Code (IaC) · Develop robust automation scripts using tools such as Terraform and Ansible to simplify infrastructure management. · Automate repetitive operational tasks to improve system reliability and reduce manual effort. · Implement CI/CD pipelines for deploying applications on Kubernetes and OpenShift environments. Incident Management & Troubleshooting · Respond to alerts, incidents, and outages with a focus on minimizing downtime and restoring services efficiently. · Conduct thorough Root Cause Analysis (RCA) for critical issues and implement long-term solutions to prevent recurrence. Disaster Recovery & High Availability · Design and implement backup and recovery strategies. · Perform BCDR (Business Continuity and Disaster Recovery) simulations. · Ensure data redundancy, failover strategies, and failback mechanisms. Security & Compliance · Ensure compliance with security best practices and regulatory requirements. · Implement secret management, encryption, and access control for sensitive infrastructure components. · Participate in security audits, vulnerability assessments, and compliance automation efforts. Cross-Team Collaboration & DevOps Culture · Work closely with development, operations, and security teams to design and implement resilient architectures. · Advocate for DevOps/SRE best practices, including blameless postmortems, incident retrospectives, and operational readiness reviews. **Required technical and professional expertise** Technical Skills · Programming Languages: Go, Python, Bash, or other scripting languages · Cloud & Infrastructure: Kubernetes, Docker, Terraform, IBM Cloud, AWS, or other cloud providers · CI/CD & Automation: GitHub Actions, Jenkins, Ansible · Monitoring & Logging: IBM Cloud Monitoring tools, Prometheus, Grafana Required Experience: · Experience in SRE, DevOps, or Software Engineering roles. · An understanding of Cloud infrastructure/operations is a must. · Proficiency in Kubernetes (certifications such as CKA or CKS are a plus). · Strong experience with OpenShift for managing containerized applications. · Proficiency in Go, Python, or Bash for automation and tool development. · Deep understanding of Linux internals, system administration, and troubleshooting. · Experience in building and managing infrastructure with Terraform, Ansible, or similar IaC tools. · Expertise in CI/CD tools such as Jenkins and GitHub Actions. · Expertise in logging and monitoring tools to ensure system observability and performance. · Strong knowledge of networking concepts, firewalls, and security best practices in Kubernetes. Required Education · Bachelor’s degree in computer science engineering/information technology Required Experience · 3-4 years IBM is committed to creating a diverse environment and is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, caste, genetics, pregnancy, disability, neurodivergence, age, veteran status, or other characteristics. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.
Confirm your E-mail: Send Email