Training Infrastructure Engineer

리얼월드

138

상시채용

5월 16일 게시

경력

3~10년차

근무지역

기타

학력

학력 무관

근무형태

정규직

직군

DevOps·SRE, 소프트웨어 엔지니어, 하드웨어엔지니어

<div style="font-family: sans-serif; line-height: 1.6; padding: 20px;"><div style="margin-bottom: 40px;"><h2 style="font-size: 22px; margin-bottom: 16px; font-weight: bold; color: #333;">주요업무</h2><div style="color: #333;"><p>We are seeking an <strong>Infrastructure Engineer</strong> to design, build, and operate large-scale distributed training systems for next-generation Robotics Foundation Models. In this role, you will develop software solutions to efficiently manage and operate high-performance GPU infrastructure and data pipelines, enabling scalable and reliable model training environments. You will play a key role in <strong>building and optimizing large-scale GPU-powered distributed training infrastructure</strong> to accelerate Robotics AI development.</p><p><strong>Join us in shaping the core infrastructure that will power the future of Robotics AI.</strong></p><p><br/></p><ul><li class="ql-indent-1"><strong>Management and Operation of Large-Scale GPU Clusters</strong>Design, maintain, and scale architecture for distributed training environments.</li><li class="ql-indent-1">Monitor GPU nodes, optimize resource scheduling, and ensure system reliability.</li><li class="ql-indent-1"><strong>Data Pipeline Design and Management</strong>Build scalable data pipelines for processing large-scale datasets used in Robotics AI training.</li><li class="ql-indent-1">Handle preprocessing, storage architecture, and distributed file systems.</li><li class="ql-indent-1"><strong>Integration and Operation of Distributed Training Frameworks</strong>Configure distributed learning environments using PyTorch, TensorFlow, etc.</li><li class="ql-indent-1">Optimize performance using frameworks like Horovod, NCCL, DeepSpeed.</li><li class="ql-indent-1"><strong>System Automation and Monitoring Solutions</strong>Develop real-time monitoring and alerting systems for GPU utilization, training metrics, and system health.</li><li class="ql-indent-1">Implement CI/CD pipelines to automate deployment and improve cluster operation efficiency.</li><li class="ql-indent-1"><strong>Performance Tuning and Optimization</strong>Analyze and optimize training speed and resource utilization at the system level.</li><li class="ql-indent-1">Improve network, storage I/O, and GPU performance across the stack.</li></ul></div></div><div style="margin-bottom: 40px;"><h2 style="font-size: 22px; margin-bottom: 16px; font-weight: bold; color: #333;">자격요건</h2><div style="color: #333;"><ul><li class="ql-indent-1"><strong>3+ Years of Large-Scale Server Infrastructure Experience</strong>Prior experience managing GPU clusters or distributed systems is preferred.</li><li class="ql-indent-1"><strong>Proficiency with Cloud and Container Technologies</strong>Experience operating services in containerized environments using Docker and Kubernetes.</li><li class="ql-indent-1">Infrastructure design and management experience on AWS, GCP, or Azure.</li><li class="ql-indent-1"><strong>Understanding of Distributed Training Environments</strong>Hands-on experience with distributed training using PyTorch, TensorFlow, etc.</li><li class="ql-indent-1">Familiarity with distributed libraries like Horovod, NCCL.</li><li class="ql-indent-1"><strong>Linux System Administration and Automation Skills</strong>Experience with shell scripting and Python for automating operations and monitoring.</li><li class="ql-indent-1">Skilled in system performance tuning and diagnostics.</li><li class="ql-indent-1"><strong>Problem-Solving and Communication Skills</strong>Ability to collaborate with research and data engineering teams and resolve technical issues effectively.</li></ul></div></div><div style="margin-bottom: 40px;"><h2 style="font-size: 22px; margin-bottom: 16px; font-weight: bold; color: #333;">우대사항</h2><div style="color: #333;"><ul><li class="ql-indent-1"><strong>Experience with Robotics or Autonomous Driving Projects</strong>Experience handling robotic sensor data (e.g., RGB-D, LiDAR) in training environments.</li><li class="ql-indent-1"><strong>Experience Operating HPC Clusters</strong>Familiarity with cluster management tools like Slurm, PBS, Torque.</li><li class="ql-indent-1">Knowledge of high-speed interfaces such as Infiniband and NVLink.</li><li class="ql-indent-1"><strong>Large-Scale Data Pipeline Architecture</strong>Experience building data lakes or lakehouse platforms.</li><li class="ql-indent-1">Proficiency in tools like Apache Spark for large-scale data processing.</li><li class="ql-indent-1"><strong>Experience with Infrastructure as Code (IaC)</strong>Hands-on experience with Terraform, Ansible, Chef, or other IaC tools for managing large-scale infrastructure.</li><li class="ql-indent-1"><strong>System Security and Access Control</strong>Experience implementing access control and security policies for GPU nodes and cluster environments.</li></ul></div></div></div>

주요업무

We are seeking an Infrastructure Engineer to design, build, and operate large-scale distributed training systems for next-generation Robotics Foundation Models. In this role, you will develop software solutions to efficiently manage and operate high-performance GPU infrastructure and data pipelines, enabling scalable and reliable model training environments. You will play a key role in building and optimizing large-scale GPU-powered distributed training infrastructure to accelerate Robotics AI development.

Join us in shaping the core infrastructure that will power the future of Robotics AI.

Management and Operation of Large-Scale GPU ClustersDesign, maintain, and scale architecture for distributed training environments.
Monitor GPU nodes, optimize resource scheduling, and ensure system reliability.
Data Pipeline Design and ManagementBuild scalable data pipelines for processing large-scale datasets used in Robotics AI training.
Handle preprocessing, storage architecture, and distributed file systems.
Integration and Operation of Distributed Training FrameworksConfigure distributed learning environments using PyTorch, TensorFlow, etc.
Optimize performance using frameworks like Horovod, NCCL, DeepSpeed.
System Automation and Monitoring SolutionsDevelop real-time monitoring and alerting systems for GPU utilization, training metrics, and system health.
Implement CI/CD pipelines to automate deployment and improve cluster operation efficiency.
Performance Tuning and OptimizationAnalyze and optimize training speed and resource utilization at the system level.
Improve network, storage I/O, and GPU performance across the stack.

자격요건

3+ Years of Large-Scale Server Infrastructure ExperiencePrior experience managing GPU clusters or distributed systems is preferred.
Proficiency with Cloud and Container TechnologiesExperience operating services in containerized environments using Docker and Kubernetes.
Infrastructure design and management experience on AWS, GCP, or Azure.
Understanding of Distributed Training EnvironmentsHands-on experience with distributed training using PyTorch, TensorFlow, etc.
Familiarity with distributed libraries like Horovod, NCCL.
Linux System Administration and Automation SkillsExperience with shell scripting and Python for automating operations and monitoring.
Skilled in system performance tuning and diagnostics.
Problem-Solving and Communication SkillsAbility to collaborate with research and data engineering teams and resolve technical issues effectively.

우대사항

Experience with Robotics or Autonomous Driving ProjectsExperience handling robotic sensor data (e.g., RGB-D, LiDAR) in training environments.
Experience Operating HPC ClustersFamiliarity with cluster management tools like Slurm, PBS, Torque.
Knowledge of high-speed interfaces such as Infiniband and NVLink.
Large-Scale Data Pipeline ArchitectureExperience building data lakes or lakehouse platforms.
Proficiency in tools like Apache Spark for large-scale data processing.
Experience with Infrastructure as Code (IaC)Hands-on experience with Terraform, Ansible, Chef, or other IaC tools for managing large-scale infrastructure.
System Security and Access ControlExperience implementing access control and security policies for GPU nodes and cluster environments.