AI Technologies
PyTorch
ML
About the Role
WHAT YOU'LL DO - Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack, from data pipelines to GPU kernels - Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization - Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks - Optimize workloads for hardware efficiency: CPU/GPU compute balance, memory management, data throughput, and networking - Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures WHAT YOU'LL BRING - Deep experience in distributed systems, ML infrastruc
Requirements
About Genesis AI
No company description available.
Similar Jobs
Staff Technical Program Manager (Bay Area)
Genesis AI • Bay Area
Principal
Lead Machine Learning Engineer
Cohere Health • United States
Junior
Staff Product Manager
Robin AI • New York City
Principal
Technical Sourcer - Contract
Scale AI • San Francisco, CA; New York, NY
Mid-level
Lead Technical Account Manager
Sully AI • US - Remote
Lead