New Taipei City, TW
2 days ago
【RD】AI Business Engineer_System
Job Responsibility

We are seeking an experienced and skilled Senior System Engineer for the testing and validation of AI servers and high-speed networking and resolving escalated service issues. This role requires expertise in performance testing and fine-tuning for AI/ML workloads, GPU configurations, and High Performance Computing (HPC) environments, to provide customers with the required performance benchmark data and to ensure that the system deployment meet the customer's requirements, as well as communicate effectively with customers to address and resolve technical issues, ensuring customer satisfaction and optimal system performance.
•    Responsibilities:
Essential duties and responsibilities include the following. Other duties may be assigned.
    Responsible for design and development activities related to hardware system and broader solution validation methodologies.
    Assure product quality by designing testing methods; testing both samples and finished products; confirming manufacturing, assembly, and installation processes
    Conduct functional testing, compatibility testing, OS testing, reliability testing, performance testing, benchmarking for AI/ML workloads, servers, GPU configurations, and HPC environments.
    Assist product managers in defining and releasing planning, product design, and supporting development activities, regulatory compliance, and product lifecycle management.
    Analyze results to identify bottlenecks and optimize system performance for AI/ML workloads.
    Design and configure high-speed network topologies (InfiniBand, Ethernet) for AI clusters.
    Configure network components to ensure optimal performance.
    Write Python scripts to automate testing, monitoring, and system optimization.
    Understanding of AI/ML frameworks (e.g., PyTorch, TensorFlow) and deployment requirements for LLMs.
    Monitor network health and server performance, proactively identifying and resolving issue.
    Conduct performance testing and application verification to ensure system reliability and efficiency.
    Collaborate with internal teams to improve system compatibility, reliability and scalability.
    Communicate effectively with customers to identify/resolve issue, answer questions, and receive feedback.

Requirements

Required:
•    Education:
    Bachelor or Master's degree in Electrical Engineering, Computer Engineering or Computer Science, or a related field
•    Experience:
    At least 5 years of work-related experience in functional testing, OS compatibility testing, performance testing, system optimization, and HPC environments.
    Hands-on experience with cloud environments, including but not limited to Docker/Containers and Kubernetes.
    Hands-on experience with workload/scheduler Managers for cluster.
    Experience with server/network hardware debugging and troubleshooting.
•    Skills:
    Language: Proficiency in English and Mandarin
    Proficiency in Linux, Kubernetes (K8S) system administration, including cluster setup and management.
    Familiarity with CUDA and GPU configurations for AI/ML performance optimization.
    Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI or RCCL/NCCL.
    Familiar with NVIDIA/AMD/Intel development tool kits such as CUDA, ROCm, oneAPI is a plus.
    Knowledge of advanced GPU architectures and accelerators.
    Ability to program with windows and Linux shell scripts.
    In-depth knowledge of high-speed networking (e.g., InfiniBand, Ethernet) and related technologies and architecture.
    Understanding of AI/ML frameworks such as PyTorch, TensorFlow, and deployment requirements for large language models (LLMs).
    Ability to conduct performance testing and benchmarking for servers, GPUs, and HPC systems.
    Capability to design, configure, and troubleshoot network topologies and components.
    Debugging, Problem-Solving and Monitoring issue resolution.
    Strong sense of teamwork and good team player, strong communication skills.
Preferred:
•    Experience: 
    Experience in deploying and managing AI server environments at scale.
    Familiarity with data center infrastructure, Gen AI, LLM, ML software infrastructure & architecture preferred

Competencies Acer-RCM-Adapting and CopingAcer-RCM-Analysing and InterpretingAcer-RCM-Creating and ConceptualisingAcer-RCM-Enterprising and PerformingAcer-RCM-Interacting and PresentingAcer-RCM-Leading and DecidingAcer-RCM-Organising and ExecutingAcer-RCM-Supporting and Co-operating
Confirm your E-mail: Send Email