
<div style="font-family: sans-serif; line-height: 1.6; padding: 20px;"><div style="margin-bottom: 40px;"><h2 style="font-size: 22px; margin-bottom: 16px; font-weight: bold; color: #333;">주요업무</h2><div style="color: #333;"><p>We are seeking an <strong>Infrastructure Engineer</strong> to design, build, and operate large-scale distributed training systems for next-generation Robotics Foundation Models. In this role, you will develop software solutions to efficiently manage and operate high-performance GPU infrastructure and data pipelines, enabling scalable and reliable model training environments. You will play a key role in <strong>building and optimizing large-scale GPU-powered distributed training infrastructure</strong> to accelerate Robotics AI development.</p><p><strong>Join us in shaping the core infrastructure that will power the future of Robotics AI.</strong></p><p><br/></p><ul><li class="ql-indent-1"><strong>Management and Operation of Large-Scale GPU Clusters</strong>Design, maintain, and scale architecture for distributed training environments.</li><li class="ql-indent-1">Monitor GPU nodes, optimize resource scheduling, and ensure system reliability.</li><li class="ql-indent-1"><strong>Data Pipeline Design and Management</strong>Build scalable data pipelines for processing large-scale datasets used in Robotics AI training.</li><li class="ql-indent-1">Handle preprocessing, storage architecture, and distributed file systems.</li><li class="ql-indent-1"><strong>Integration and Operation of Distributed Training Frameworks</strong>Configure distributed learning environments using PyTorch, TensorFlow, etc.</li><li class="ql-indent-1">Optimize performance using frameworks like Horovod, NCCL, DeepSpeed.</li><li class="ql-indent-1"><strong>System Automation and Monitoring Solutions</strong>Develop real-time monitoring and alerting systems for GPU utilization, training metrics, and system health.</li><li class="ql-indent-1">Implement CI/CD pipelines to automate deployment and improve cluster operation efficiency.</li><li class="ql-indent-1"><strong>Performance Tuning and Optimization</strong>Analyze and optimize training speed and resource utilization at the system level.</li><li class="ql-indent-1">Improve network, storage I/O, and GPU performance across the stack.</li></ul></div></div><div style="margin-bottom: 40px;"><h2 style="font-size: 22px; margin-bottom: 16px; font-weight: bold; color: #333;">자격요건</h2><div style="color: #333;"><ul><li class="ql-indent-1"><strong>3+ Years of Large-Scale Server Infrastructure Experience</strong>Prior experience managing GPU clusters or distributed systems is preferred.</li><li class="ql-indent-1"><strong>Proficiency with Cloud and Container Technologies</strong>Experience operating services in containerized environments using Docker and Kubernetes.</li><li class="ql-indent-1">Infrastructure design and management experience on AWS, GCP, or Azure.</li><li class="ql-indent-1"><strong>Understanding of Distributed Training Environments</strong>Hands-on experience with distributed training using PyTorch, TensorFlow, etc.</li><li class="ql-indent-1">Familiarity with distributed libraries like Horovod, NCCL.</li><li class="ql-indent-1"><strong>Linux System Administration and Automation Skills</strong>Experience with shell scripting and Python for automating operations and monitoring.</li><li class="ql-indent-1">Skilled in system performance tuning and diagnostics.</li><li class="ql-indent-1"><strong>Problem-Solving and Communication Skills</strong>Ability to collaborate with research and data engineering teams and resolve technical issues effectively.</li></ul></div></div><div style="margin-bottom: 40px;"><h2 style="font-size: 22px; margin-bottom: 16px; font-weight: bold; color: #333;">우대사항</h2><div style="color: #333;"><ul><li class="ql-indent-1"><strong>Experience with Robotics or Autonomous Driving Projects</strong>Experience handling robotic sensor data (e.g., RGB-D, LiDAR) in training environments.</li><li class="ql-indent-1"><strong>Experience Operating HPC Clusters</strong>Familiarity with cluster management tools like Slurm, PBS, Torque.</li><li class="ql-indent-1">Knowledge of high-speed interfaces such as Infiniband and NVLink.</li><li class="ql-indent-1"><strong>Large-Scale Data Pipeline Architecture</strong>Experience building data lakes or lakehouse platforms.</li><li class="ql-indent-1">Proficiency in tools like Apache Spark for large-scale data processing.</li><li class="ql-indent-1"><strong>Experience with Infrastructure as Code (IaC)</strong>Hands-on experience with Terraform, Ansible, Chef, or other IaC tools for managing large-scale infrastructure.</li><li class="ql-indent-1"><strong>System Security and Access Control</strong>Experience implementing access control and security policies for GPU nodes and cluster environments.</li></ul></div></div></div>
주요업무
We are seeking an Infrastructure Engineer to design, build, and operate large-scale distributed training systems for next-generation Robotics Foundation Models. In this role, you will develop software solutions to efficiently manage and operate high-performance GPU infrastructure and data pipelines, enabling scalable and reliable model training environments. You will play a key role in building and optimizing large-scale GPU-powered distributed training infrastructure to accelerate Robotics AI development.
Join us in shaping the core infrastructure that will power the future of Robotics AI.
- Management and Operation of Large-Scale GPU ClustersDesign, maintain, and scale architecture for distributed training environments.
- Monitor GPU nodes, optimize resource scheduling, and ensure system reliability.
- Data Pipeline Design and ManagementBuild scalable data pipelines for processing large-scale datasets used in Robotics AI training.
- Handle preprocessing, storage architecture, and distributed file systems.
- Integration and Operation of Distributed Training FrameworksConfigure distributed learning environments using PyTorch, TensorFlow, etc.
- Optimize performance using frameworks like Horovod, NCCL, DeepSpeed.
- System Automation and Monitoring SolutionsDevelop real-time monitoring and alerting systems for GPU utilization, training metrics, and system health.
- Implement CI/CD pipelines to automate deployment and improve cluster operation efficiency.
- Performance Tuning and OptimizationAnalyze and optimize training speed and resource utilization at the system level.
- Improve network, storage I/O, and GPU performance across the stack.
자격요건
- 3+ Years of Large-Scale Server Infrastructure ExperiencePrior experience managing GPU clusters or distributed systems is preferred.
- Proficiency with Cloud and Container TechnologiesExperience operating services in containerized environments using Docker and Kubernetes.
- Infrastructure design and management experience on AWS, GCP, or Azure.
- Understanding of Distributed Training EnvironmentsHands-on experience with distributed training using PyTorch, TensorFlow, etc.
- Familiarity with distributed libraries like Horovod, NCCL.
- Linux System Administration and Automation SkillsExperience with shell scripting and Python for automating operations and monitoring.
- Skilled in system performance tuning and diagnostics.
- Problem-Solving and Communication SkillsAbility to collaborate with research and data engineering teams and resolve technical issues effectively.
우대사항
- Experience with Robotics or Autonomous Driving ProjectsExperience handling robotic sensor data (e.g., RGB-D, LiDAR) in training environments.
- Experience Operating HPC ClustersFamiliarity with cluster management tools like Slurm, PBS, Torque.
- Knowledge of high-speed interfaces such as Infiniband and NVLink.
- Large-Scale Data Pipeline ArchitectureExperience building data lakes or lakehouse platforms.
- Proficiency in tools like Apache Spark for large-scale data processing.
- Experience with Infrastructure as Code (IaC)Hands-on experience with Terraform, Ansible, Chef, or other IaC tools for managing large-scale infrastructure.
- System Security and Access ControlExperience implementing access control and security policies for GPU nodes and cluster environments.







