DevOps·SRE
정규직
3~10년차
2314에서 채용 중
채용 기업로고 이미지

Training Infrastructure Engineer

리얼월드
조회수 아이콘
138
달력 아이콘
상시채용
5월 16일 게시
경력
3~10년차
근무지역
기타
학력
학력 무관
근무형태
정규직
직군
DevOps·SRE, 소프트웨어 엔지니어, 하드웨어엔지니어

주요업무

We are seeking an Infrastructure Engineer to design, build, and operate large-scale distributed training systems for next-generation Robotics Foundation Models. In this role, you will develop software solutions to efficiently manage and operate high-performance GPU infrastructure and data pipelines, enabling scalable and reliable model training environments. You will play a key role in building and optimizing large-scale GPU-powered distributed training infrastructure to accelerate Robotics AI development.

Join us in shaping the core infrastructure that will power the future of Robotics AI.


  • Management and Operation of Large-Scale GPU ClustersDesign, maintain, and scale architecture for distributed training environments.
  • Monitor GPU nodes, optimize resource scheduling, and ensure system reliability.
  • Data Pipeline Design and ManagementBuild scalable data pipelines for processing large-scale datasets used in Robotics AI training.
  • Handle preprocessing, storage architecture, and distributed file systems.
  • Integration and Operation of Distributed Training FrameworksConfigure distributed learning environments using PyTorch, TensorFlow, etc.
  • Optimize performance using frameworks like Horovod, NCCL, DeepSpeed.
  • System Automation and Monitoring SolutionsDevelop real-time monitoring and alerting systems for GPU utilization, training metrics, and system health.
  • Implement CI/CD pipelines to automate deployment and improve cluster operation efficiency.
  • Performance Tuning and OptimizationAnalyze and optimize training speed and resource utilization at the system level.
  • Improve network, storage I/O, and GPU performance across the stack.

자격요건

  • 3+ Years of Large-Scale Server Infrastructure ExperiencePrior experience managing GPU clusters or distributed systems is preferred.
  • Proficiency with Cloud and Container TechnologiesExperience operating services in containerized environments using Docker and Kubernetes.
  • Infrastructure design and management experience on AWS, GCP, or Azure.
  • Understanding of Distributed Training EnvironmentsHands-on experience with distributed training using PyTorch, TensorFlow, etc.
  • Familiarity with distributed libraries like Horovod, NCCL.
  • Linux System Administration and Automation SkillsExperience with shell scripting and Python for automating operations and monitoring.
  • Skilled in system performance tuning and diagnostics.
  • Problem-Solving and Communication SkillsAbility to collaborate with research and data engineering teams and resolve technical issues effectively.

우대사항

  • Experience with Robotics or Autonomous Driving ProjectsExperience handling robotic sensor data (e.g., RGB-D, LiDAR) in training environments.
  • Experience Operating HPC ClustersFamiliarity with cluster management tools like Slurm, PBS, Torque.
  • Knowledge of high-speed interfaces such as Infiniband and NVLink.
  • Large-Scale Data Pipeline ArchitectureExperience building data lakes or lakehouse platforms.
  • Proficiency in tools like Apache Spark for large-scale data processing.
  • Experience with Infrastructure as Code (IaC)Hands-on experience with Terraform, Ansible, Chef, or other IaC tools for managing large-scale infrastructure.
  • System Security and Access ControlExperience implementing access control and security policies for GPU nodes and cluster environments.
logo
에이아이커리어
서울특별시 성동구 뚝섬로3길 11-5
대표 : 이재헌
이메일 : paca@zighang.com
연락처 : 010-9862-5855
사업자등록 : 256-15-02584
직업정보제공사업 신고번호: J1202020240011