GDIT is seeking a Senior HPC Architect to join our Scientific Infrastructure Team, providing High Performance Computing (HPC) services for a large biomedical research community with the National Institute of Allergy and Infectious Diseases (NIAID). Our Scientific Infrastructure Team is responsible for enabling and managing HPC and its associated infrastructure and interconnects across multiple locations, 100’s of COTS and open-source scientific applications, and ~40PB of data storage to include data archive, lifecycle policy management and data sharing services. This team serves as a customer-facing presence for the NIAID research community, providing a single point of support for new initiatives, ongoing projects, and scientific infrastructure needs.
In your role as a Senior HPC Architect, you will be a subject matter expert architecting, implementing, and managing multiple high performance compute clusters and their associated infrastructure for a large biomedical research community.
Work Visa sponsorship will not be provided for this position.
HOW A SENIOR HPC ARCHITECT WILL MAKE AN IMPACT:
Provide hands-on administration and support for two HPC clusters; a 4000+ core HPC cluster that is GPU-focused and a 1,500+ core HPC cluster, including monitoring performance and health of both clustersInstall and support bioinformatics applications for a large and diverse research community with needs in genomics, cryo-electron microscopy, AI/MLArchitect and design HPC clusters to include designing new clusters or expanding existing components such as storage, InfiniBand, and computeMonitor and report on cluster performance and generate data to show usage and trendsPerform troubleshooting and problem-solving for complex HPC operational and performance issuesCollaborate with researchers to guide them in effective use of the HPC resources, such as job scheduler submission, data formats, and building data workflows to effectively move data from scientific instruments to the HPC clusters for analysis.Provide input to the Scientific Infrastructure team leader for setting priorities for cluster operations, scheduling policies, resources needed, etc.Develop and maintain documentation and diagrams for the HPC clusters, review GitHub pull requests, and update content and training materials on the user wiki portal.Teach and mentor team members on system design, best practices, and troubleshooting techniques.WHAT YOU’LL NEED TO SUCCEED:
Education: BS/BA (or equivalent)
Required Experience: Minimum of 10 years related experience
Required Technical Skills:
Minimum of 5 years’ experience as engineer or architect with HPC technologiesHands-on architecture design experience with HPC to include storage, file system, InfiniBand, security, authentication, and compute architecturesExperience with Slurm job scheduling, including troubleshooting job status and optimizing submission scriptsExperience using Git to manage shared software configuration code basesHands-on experience with cloud-based services (e.g. Azure, AWS, GCP)Minimum of five years’ experience in Linux systems administrationGood understanding of storage administration and optimization, such as performing upgrades and defining RAID configurationsGood understanding of fundamental networking concepts and their practical applicationsExperience with Spack or EasyBuild package manager, including making packages from PyPi, R, GithubKnowledge and experience in one or more scripting languages applicable to Linux (e.g. Bash, Perl, Python)Security Clearance Level: Must be able to obtain a NIH Public TrustPreferred Skills:
Experience administering RedHat / CentOS based systemsExperience working in a life-sciences oriented environmentExperience configuring and using monitoring systems to monitor HPC clustersAbility to determine meaningful metrics and usage data for monthly status reports and health dashboardsExperience with DevOps or DevSecOps methodologies, such as automation and configuration managementStrong troubleshooting skillsLocation: Remote