Bogota
4 days ago
Lead I - Software Engineering

Job Title: Site Reliability Engineer

Overview: We are seeking a proactive and skilled Site Reliability Engineer (SRE) to maintain and enhance the performance, reliability, and security of our tools and services. As an SRE, you will focus on lifecycle management, operational support, incident management, and system optimization to ensure seamless IT delivery and infrastructure stability.

Areas of Focus:

Lifecycle Management & Vulnerability Management: Keep software and services up-to-date with monthly updates. Identify, report, and mitigate security vulnerabilities. System Reliability & Incident Management: Monitor system performance, uptime, and reliability. Handle incident response and management using ServiceNow and PagerDuty. Conduct blameless postmortems to analyze incidents and create actionable items for system improvements and prevention of future occurrences. User Support & Documentation: Provide user support through Slack, email, and ServiceNow requests. Create and maintain detailed documentation. Observability & Audits: Implement monitoring solutions, ing, and dashboards. Conduct regular audits to ensure compliance with internal and external standards. Performance Tuning & Disaster Recovery: Optimize system performance through regular analysis and improvements. Develop and implement disaster recovery plans to ensure business continuity. Capacity Planning & Change Management: Plan and manage system capacity to meet future demands. Coordinate and execute change management processes. Maintenance Window Support & On-Call Duties: Manage and support systems during maintenance windows. Participate in on-call rotations to provide 24/7 support.

A Day in the Life:

Incident Management:

Handle incidents reported in ServiceNow and PagerDuty as the primary contact. Identify and resolve issues to maintain system stability and reliability. Perform blameless postmortems to analyze incidents and implement preventive measures.

Request Fulfillment:

Monitor and manage the ServiceNow request queue. Fulfill user support requests within the three-day SLA. Provide assistance through various channels such as Slack, email, and ServiceNow.

Lifecycle Management & Vulnerability Management:

Conduct regular lifecycle management activities, including applying patches and updating software versions. Perform vulnerability assessments and implement necessary security measures.

System Performance & Change Management:

Participate in performance tuning and system optimization tasks. Engage in change management processes, coordinating and implementing infrastructure or tool changes.

Documentation & Compliance:

Create and maintain documentation for systems, processes, and changes. Ensure compliance with internal and external standards through regular audits.

Maintenance Window & On-Call Duties:

Provide support during scheduled maintenance windows. Participate in on-call rotations to ensure continuous support and quick resolution of issues.

Must Have:

Solid experience in SRE or related roles: A minimum of 3-5 years of experience in maintaining, optimizing, and managing production-level infrastructure. Proficiency with monitoring tools: Experience with tools such as Prometheus, Grafana, Datadog, or similar to implement monitoring solutions, s, and dashboards. Incident management expertise: Experience handling incidents using platforms like ServiceNow and PagerDuty, along with the ability to conduct blameless postmortems for analysis and improvement. Knowledge of operating systems and networking: Experience with Linux/Unix, networking, virtualization, and database administration. Vulnerability and patch management: Experience applying monthly security updates and mitigating vulnerabilities in production environments. Scripting and automation skills: Proficiency in scripting languages like Python, Bash, or Go, as well as experience with automation tools like Ansible, Terraform, or similar. Performance tuning expertise: Experience identifying and resolving system performance bottlenecks and implementing optimization solutions. Experience in high-availability environments: Commitment to system stability and uptime, with experience in disaster recovery planning and participation in 24/7 on-call rotations.

 

Confirm your E-mail: Send Email