Principal Architect, GPU Tools & Diagnostics
Lenovo
Why Work at Lenovo
Description and Requirements
Lenovo 's Infrastructure Solutions Group (ISG) is seeking an experienced platform RAS, diagnostics, and software architect to define, design, and implement RAS and monitoring solutions for GenAI servers. Ideal candidate will have a deep understanding of GPU and ARM architectures, RAS principles, and system monitoring considerations.
Job Responsibilities:
- Lead the architecture, design, and development of GPU tools for diagnostics, debugging, and performance analysis across a variety of GPU hardware and software environments.
- Reviews current and future technology roadmaps to identify serviceability and supportability requirements for GPU tool development.
- Work closely with cross-functional teams, including GPU hardware engineers, software developers, and system architects, to integrate diagnostic tools into GPU subsystems to identifying and troubleshooting complex issues and optimize serviceability.
- Works with our technology partners on understanding their products and support capabilities, enhancing partner and Lenovo’s end customer support, facilitates support knowledge transfer to support organization.
- Establishes a win-win relationship with technology partners to identify and drive operational (vs. design validation) support solutions to address needs of end customers.
- Conducts market research and competitive support benchmarking.
- Keeps up with industry standards and best practices for serviceability and supportability of new technologies.
- Ability to write scripts or code to automate serviceability tasks and develop diagnostic tools.
Basic Requirements:
- 10+ years' of relevant industry experience in GPU or CPU architecture and networking protocols (or other equivalent experience)
- 5+ years' of experience in designing software and firmware for various compute environments.
- Ph.D. or MS in Computer Science, Electrical Engineering or Computer Engineering or equivalent experience.
Preferred Requirements:
- Expertise in designing and building debugging and diagnostic tools for CPU/GPU subsystems.
• Background in computer architecture, graphics algorithms, and parallel processing.
• Comprehensive knowledge of server components, operating systems, and network protocols
• Strong hands-on programming in C, C++, Perl and Python. GPU programming languages such as CUDA, OpenCL, or Vulkan.
• Expertise in current data center operating systems and software development methodologies.
• Proficiencies in high performance computing, DDR, PCIe, and communication protocols such as Redfish, I2C, SPI, and MDIO.
• Strong technical documentation skills and excellent written and verbal