Summary
Senior systems architect and engineer specializing in the design,
build, and operation of large-scale, high-performance GPU clusters for
AI and machine learning workloads. Proven track record of architecting
and troubleshooting large-scale systems for both internal R&D and
major hyperscale customers.
Professional Experience
Principal Engineer/Director, NVIDIA, 2022 -
Present
- Led the architecture for multiple cluster designs based on Grace
Hopper, Grace Blackwell, and Vera Rubin platforms
- Designed next-gen network fabrics based on InfiniBand, RoCE, and
NVLink
- Led technical collaboration with key customers on hyperscale AI
training and inference clusters based on NVIDIA reference
architectures
- Developed a systematic approach to the design of new architectures
to speed iteration and time to deployment
- Developed automation and performed at-scale troubleshooting for Eos,
a 4,000+ GPU DGX H100 SuperPOD
Senior Solutions Architect, NVIDIA, 2019 - 2022
- Customer-facing systems architect for design of supercomputing
systems
- Automation and deployment for Selene, a 4,000+ GPU DGX A100
SuperPOD
- Systems software lead for several early SuperPOD deployments,
including developing the automation system for these clusters
- Technical lead for multiple RFP responses for supercomputing
projects at Department of Energy sites
Senior Production Engineer, Facebook, 2017 -
2019
- Production Engineering lead for metadata services of Tectonic,
Facebook’s exabyte-scale distributed filesystem
- Migrated Tectonic control plane services to shared container
management platform, freeing >60% of high-cost instances and reducing
on-call burden
- Developed and deployed automated capacity planning tools based on
actual and projected traffic requirements
- Characterized the performance of Tectonic storage service on a
next-generation storage server (“Bryce Canyon”)
- Onboarded and mentored several new team members, from interns to
senior engineers
HPC Cluster Administrator, Los Alamos National
Laboratory, 2014 - 2017
- Production team lead for deployment of Trinity, a 20,000-node Cray
XC40 supercomputer
- Developed system automation for configuration and production
operations of LANL capability systems, including leading a migration
from Cfengine to Ansible
HPC/Cloud Systems Engineer, NVIDIA, 2012 - 2014
- Supported an early internal HPC development platform for the
Solution Architect team, including ongoing hardware and software
refresh
- Developed internal and cloud-hosted training platform for customer
tutorials, including at conferences such as Supercomputiong and GTC
Systems Engineer, R Systems NA, 2010 - 2012
- Developed custom cluster designs to support particular customer
workloads, including climate simulations, financial modeling, and
Formula-One racing
Education
M.S., University of Illinois at Urbana-Champaign,
Materials Science and Engineering.
Thesis title: Fabrication,
dynamics, and self-assembly of anisotropic colloidal particles.
B.S., Michigan Technological University,
Physics.
Minors in Mathematics and Electronic Materials.
Skills and Technologies
- HPC/ML infrastructure:
- Compute: NVIDIA DGX SuperPOD, HGX, and GB200/GB300
- Networking: InfiniBand (NDR, XDR), RoCE (NVIDIA Spectrum-X),
NVLink
- Storage: Lustre, Ceph, HDFS, S3-compatible, Tectonic
- Workload Mgmt: Kubernetes, Slurm, PBS/Torque/Moab
- Programming & Automation:
- Languages: Python, Go, (Production); Rust, OCaml (Hobby)
- Automation: Ansible, Terraform, Cfengine, AWS, Azure