Summary
Senior systems architect and engineer specializing in the design,
build, and operation of large-scale, high-performance GPU clusters for
AI and machine learning workloads. Proven track record of architecting
and troubleshooting large-scale systems for both internal R&D and
major hyperscale customers.
Professional Experience
NVIDIA,
2022 - Present: Architecture of Future GPU Clusters
Director of Next-Gen Cluster Architecture, 2025 -
Present
Senior Manager, 2023 - 2025
Principal Systems Software Engineer, 2022 - 2023
- Led the architecture process for future cluster architectures based
on the Grace-Hopper, Grace-Blackwell and Vera-Rubin platforms
- Reference designs developed for clusters based on DGX B200, GB200
NVL36x2, GB200 NVL72, and GB300 NVL72 products
- Published internal reference architectures to provide a validated
design for both internal and customer builds
- Designed and led proof-of-concept projects for next-generation
compute fabric architectures, including initial deployments for XDR
InfiniBand and multi-plane SpectrumX (RoCE) Ethernet fabrics
- Developed a systematic approach to design of new architectures to
streamline development for upcoming builds
- Initiated a new team focused on cluster architecture and datacenter
design, and grew that team from an initial group of 4 engineers to a
team of 10
- Developed automation and performed at-scale troubleshooting for Eos,
a 4,000+ GPU DGX H100 SuperPOD ranked at #9 on the Top500 list (Nov
2023)
- Collaborated directly with key customers on hyperscale AI training
clusters based on NVIDIA reference architectures
- Track chair for datacenter and cloud papers at NTECH 2025, an
internal technology conference
NVIDIA,
2019 - 2022: DGX SuperPOD and Customer Builds
Senior Product Architect, 2021 - 2022
- Collaborated closely with enterprise product management teams for
large-scale HPC projects
- Design and project management for initial customer deployments of
NVIDIA Base Command Manager, a tool for cluster deployment and system
management
- Developed automation and content for early NVIDIA Launchpad
projects
Senior Solutions Architect, 2019 - 2021
- Customer-facing systems architect for design of supercomputing
systems
- Systems software lead for several initial customer deployments of
DGX A100 SuperPOD, including developing the deployment automation system
for these clusters
- Automation and hardware deployment for Selene, a 4,000+ GPU DGX A100
SuperPOD raked at #5 on the Top500 list (June 2020)
- Technical lead for multiple RFP responses for supercomputing
projects at Department of Energy sites
- Technical lead for initial customer deployments of DRIVE
Constellation, a datacenter product for validation of autonomous vehicle
software stacks
- System management of multiple internal clusters geared toward
benchmarking and application development
Facebook, 2017
- 2019: Exabyte-Scale Blob Storage
Senior Production Engineer
- Production Engineering lead for metadata services of Facebook
Tectonic, an exabyte-scale distributed filesystem used for data
warehouse and blob storage applications (known internally as “warm
storage”)
- Migrated Tectonic control plane services to shared container
management platform, reducing usage of high-cost instances by >60%
and reducing on-call burden for hardware maintenance tasks
- Developed and deployed automated capacity planning tools based on
actual and projected traffic requirements
- Analyzed and improved database replication strategy for metadata
services, mitigating an impactful class of production incidents
- Characterized the performance of Tectonic storage service on a
next-generation storage server (``Bryce Canyon’’)
- Onboarded and mentored several new team members, including both new
college grads and senior hires
- Organized multiple team events for Seattle Production Engineering
team
Los
Alamos National Laboratory, 2014 - 2017: Top-10 Supercomputing
HPC Cluster Administrator
- Production team lead for intial deployment of Trinity, a 20,000-node
Cray XC40 supercomputer ranked at #6 on the Top 500 list (Nov 2015)
- Developed system automation for configuration and production
operations of LANL capability systems, including leading a migration
from Cfengine to Ansible
- Coordinator for production change control process for LANL HPC
systems
- Derivative classifier (Q clearance) for review of HPC publications,
including conference presentations and internal reports
HPC/Cloud Systems Engineer
- Supported an early internal HPC development platform for the
Solution Architect team, including ongoing hardware and software
refresh
- Collaborated closely with developer technology engineers and
third-party developers to support specific hardware tests
- Support for customer evaluation workloads on a dedicated
NVIDIA-hosted HPC platform
- Developed internal and cloud-hosted training platform for customer
tutorials, including at conferences such as Supercomputiong and GTC
R Systems NA, 2010
- 2012: Custom HPC Clusters
Systems Engineer
- Designed, deployed, and supported hosted HPC clusters for industrial
customers
- Developed custom solutions to support particular customer workloads,
including climate simulations, financial modeling, and Formula-One
racing
- Experience included a wide variety of software stacks, including Red
Hat Enterprise Linux, Ubuntu, and Windows 2008 HPC
Skills and Technologies
- Technical leader with experience planning large, complex projects
and setting technical direction, both as a manager and as a senior/staff
individual contributor
- Software development experience focused on development of
operational tooling and infrastructure services, primarily using Python,
Go, and C++
- Scientific software and data analysis experience using Python,
Julia, Matlab, and Fortran
- Debugging and hobby projects for applications written in Rust,
Clojure, and OCaml
- Experienced SRE responsible for large-scale distributed systems
(10,000+ nodes) including compute clustgers, distributed storage
systems, and microservice-based architectures
- Familiar with a wide variety of bare-metal deployment toolkits based
on core principles of network boot (PXE) followed by either stateless
operation with RAM-based root, or installation of the OS on a local disk
- Tools include Canonical MaaS, NVIDIA BCM, Warewulf, Perceus, xCAT,
Cobbler, …
- Configuration management using tools such as Ansible, Cfengine,
Chef, or Puppet to converge a running system to a desired state
- Deployment and performance tuning of parallel and distributed
filesystems such as Lustre, Ceph, HDFS, and S3-compatible object
storage
- Extensive experience with the design and operation of complex
monitoring and observability systems, both using custom implementations
and with open-source tooling such as Prometheus, Grafana, Loki, and the
ELK stack
Education
M.S., University of Illinois at Urbana-Champaign,
Materials Science and Engineering.
Thesis title: Fabrication,
dynamics, and self-assembly of anisotropic colloidal particles.
B.S., Michigan Technological University,
Physics.
Minors in Mathematics and Electronic Materials.
Selected Publications
- A. DeConinck, “Architecting and deploying compute clusters for large
language models”, 2nd Workshop on Advancing Neural Network
Training, International Conference on Machine Learning,
July 2024.
- A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, M. Mason,
J. Brandt, A. Gentile, B. Allan, A. Agelastos, M. Davis and M. Berry.
“Design and implementation of a scalable monitoring system for Trinity”,
Proc. Cray User’s Group, May 2016.
- S. Sanchez, A. Bonnie, G. Van Huele, C. Robinson, A. DeConinck, K.
Kelly, Q. Snead and J. Brandt, “Design and Implementation of a Scalable
HPC Monitoring System”, Wrk. on Monitoring and Analysis for High
Performance Computing Systems Plus Applications (HPCMASPA), 2016 IEEE
International Parallel and Distributed Processing Symposim (IPDPS),
May 2016.
- A. DeConinck and K. Kelly. “Evolution of Monitoring Over the
Lifetime of a High Performance Computing Cluster”. Wrk. on
Monitoring and Analysis for High Performance Computing Systems Plus
Applications (HPCMASPA), 2015 IEEE International Conference on Cluster
Computing (CLUSTER), September 2015.
- A. J. DeConinck. “Tools and Tips for Managing a GPU Cluster”.
GPU Technology Conference, March 2014.