Advisory AI Infrastructure Engineer
Software Engineering, Other Engineering, Data Science
Edinburgh, UK
Why Work at Lenovo
Description and Requirements
Lenovo is seeking an accomplished Advisory AI Infrastructure Engineer to take a leadership role within our Advanced AI Technology Center. In this position, you will architect, build, and evolve the large-scale infrastructure and platforms that underpin AI model development, deployment, and operation across the organization. You will serve as a technical authority on infrastructure strategy, mentor junior engineers, drive best practices, and collaborate cross-functionally with research, engineering, and product teams. Your deep expertise will be instrumental in scaling our AI capabilities and ensuring production-grade reliability, performance, and security. If you are passionate about making Smarter Technology For All, come help us realize our Hybrid AI vision!
Responsibilities
- AI Infrastructure Strategy and Architecture: Define and drive the long-term infrastructure strategy for AI workloads. Architect scalable, cost-efficient, and resilient compute, storage, and networking solutions that support the full AI lifecycle from experimentation through production.
- Technical Leadership and Mentorship: Serve as a technical lead and mentor to infrastructure engineers, establishing engineering standards, conducting design reviews, and fostering a culture of operational excellence across the team.
- AI Model Deployment and Platform Engineering: Design and own the platforms, frameworks, and processes for deploying, monitoring, and managing AI models at scale in production environments, including model serving, A/B testing infrastructure, and rollback mechanisms.
- Advanced Automation and CI/CD: Architect and maintain sophisticated automation pipelines for AI model training, evaluation, testing, and deployment. Champion infrastructure-as-code practices and drive continuous improvement of CI/CD workflows.
- Cross-Functional Collaboration: Partner with data scientists, ML engineers, product managers, and leadership to align infrastructure capabilities with research and business objectives. Act as the primary infrastructure point of contact for cross-team initiatives.
- Performance Engineering: Lead efforts to profile, benchmark, and optimize AI infrastructure and model serving for throughput, latency, GPU utilization, scalability, and cost efficiency at scale.
- Security, Compliance, and Governance: Establish and enforce security best practices, access controls, and compliance frameworks across AI infrastructure. Ensure adherence to relevant regulatory requirements and internal governance policies.
- Capacity Planning and Cost Management: Own capacity forecasting, resource planning, and cost optimization for GPU clusters and associated infrastructure, balancing performance needs with budget constraints.
Qualifications
- Bachelor’s or Master’s degree in Computer Engineering, Electrical Engineering, Computer Science, or a related field. Advanced degree preferred.
- 8+ years of experience in software engineering, infrastructure engineering, DevOps/SRE, or a related field, with at least 4 years focused on AI/ML infrastructure.
- Demonstrated experience leading or mentoring engineering teams on infrastructure projects.
- Deep expertise in computer systems, distributed systems, and cloud computing architectures.
- Extensive experience designing, deploying, and managing multi-node distributed GPU clusters using Slurm, Kubernetes, or equivalent orchestration platforms.
- Expert-level Linux system administration, including package management, user/group management, file system internals, shell scripting (e.g., Bash), networking, and system configuration (e.g., systemd, kernel tuning).
- Strong proficiency in Python and at least one additional systems language (e.g., Go, C++, Rust, Java).
- Deep experience with AI-specific hardware and software stacks (e.g., NVIDIA GPUs, CUDA, cuDNN, NCCL, InfiniBand/RoCE networking).
- Proven track record managing high-performance computing (HPC) environments, including job scheduling, resource allocation, cluster maintenance, and performance tuning.
- Significant experience with AI model deployment, serving infrastructure, and lifecycle management in production.
- Strong architectural thinking with the ability to balance trade-offs across performance, reliability, security, and cost.
- Excellent communication and collaboration skills, with the ability to influence technical decisions across teams and present to senior leadership.
- Ability to thrive in a fast-paced, ambiguous environment and drive clarity through technical leadership.
Bonus Points
- Experience with AI and machine learning frameworks at scale (e.g., PyTorch, DeepSpeed, Megatron-LM, vLLM).
- Hands-on experience with major cloud platforms (e.g., AWS, GCP, Azure) and hybrid cloud architectures.
- Advanced experience with containerization (Docker) and orchestration (Kubernetes), including custom operators, Helm charts, and GPU scheduling plugins.
- Expertise with observability stacks (e.g., Prometheus, Grafana, ELK/OpenSearch, Datadog) for infrastructure and model monitoring.
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible, Pulumi).
- Contributions to open-source infrastructure or ML tooling projects.
- Experience with large language model (LLM) training and inference infrastructure.
#LATC