Adam DeConinck

ajdecon@ajdecon.org | Denver, CO | www.ajdecon.org

Summary

Senior systems architect and engineer specializing in the design, build, and operation of large-scale, high-performance GPU clusters for AI and machine learning workloads. Proven track record of architecting and troubleshooting large-scale systems for both internal R&D and major hyperscale customers.

Professional Experience

NVIDIA, 2022 - Present: Architecture of Future GPU Clusters

Director of Next-Gen Cluster Architecture, 2025 - Present
Senior Manager, 2023 - 2025
Principal Systems Software Engineer, 2022 - 2023

Led the architecture process for future cluster architectures based on the Grace-Hopper, Grace-Blackwell and Vera-Rubin platforms
- Reference designs developed for clusters based on DGX B200, GB200 NVL36x2, GB200 NVL72, and GB300 NVL72 products
- Published internal reference architectures to provide a validated design for both internal and customer builds
Designed and led proof-of-concept projects for next-generation compute fabric architectures, including initial deployments for XDR InfiniBand and multi-plane SpectrumX (RoCE) Ethernet fabrics
Developed a systematic approach to design of new architectures to streamline development for upcoming builds
Initiated a new team focused on cluster architecture and datacenter design, and grew that team from an initial group of 4 engineers to a team of 10
Developed automation and performed at-scale troubleshooting for Eos, a 4,000+ GPU DGX H100 SuperPOD ranked at #9 on the Top500 list (Nov 2023)
Collaborated directly with key customers on hyperscale AI training clusters based on NVIDIA reference architectures
Track chair for datacenter and cloud papers at NTECH 2025, an internal technology conference

NVIDIA, 2019 - 2022: DGX SuperPOD and Customer Builds

Senior Product Architect, 2021 - 2022

Collaborated closely with enterprise product management teams for large-scale HPC projects
Design and project management for initial customer deployments of NVIDIA Base Command Manager, a tool for cluster deployment and system management
Developed automation and content for early NVIDIA Launchpad projects

Senior Solutions Architect, 2019 - 2021

Customer-facing systems architect for design of supercomputing systems
Systems software lead for several initial customer deployments of DGX A100 SuperPOD, including developing the deployment automation system for these clusters
Automation and hardware deployment for Selene, a 4,000+ GPU DGX A100 SuperPOD raked at #5 on the Top500 list (June 2020)
Technical lead for multiple RFP responses for supercomputing projects at Department of Energy sites
Technical lead for initial customer deployments of DRIVE Constellation, a datacenter product for validation of autonomous vehicle software stacks
System management of multiple internal clusters geared toward benchmarking and application development

Facebook, 2017 - 2019: Exabyte-Scale Blob Storage

Senior Production Engineer

Production Engineering lead for metadata services of Facebook Tectonic, an exabyte-scale distributed filesystem used for data warehouse and blob storage applications (known internally as “warm storage”)
Migrated Tectonic control plane services to shared container management platform, reducing usage of high-cost instances by >60% and reducing on-call burden for hardware maintenance tasks
Developed and deployed automated capacity planning tools based on actual and projected traffic requirements
Analyzed and improved database replication strategy for metadata services, mitigating an impactful class of production incidents
Characterized the performance of Tectonic storage service on a next-generation storage server (``Bryce Canyon’’)
Onboarded and mentored several new team members, including both new college grads and senior hires
Organized multiple team events for Seattle Production Engineering team

Los Alamos National Laboratory, 2014 - 2017: Top-10 Supercomputing

HPC Cluster Administrator

Production team lead for intial deployment of Trinity, a 20,000-node Cray XC40 supercomputer ranked at #6 on the Top 500 list (Nov 2015)
Developed system automation for configuration and production operations of LANL capability systems, including leading a migration from Cfengine to Ansible
Coordinator for production change control process for LANL HPC systems
Derivative classifier (Q clearance) for review of HPC publications, including conference presentations and internal reports

NVIDIA, 2012 - 2014: HPC Development Platforms

HPC/Cloud Systems Engineer

Supported an early internal HPC development platform for the Solution Architect team, including ongoing hardware and software refresh
Collaborated closely with developer technology engineers and third-party developers to support specific hardware tests
Support for customer evaluation workloads on a dedicated NVIDIA-hosted HPC platform
Developed internal and cloud-hosted training platform for customer tutorials, including at conferences such as Supercomputiong and GTC

R Systems NA, 2010 - 2012: Custom HPC Clusters

Systems Engineer

Designed, deployed, and supported hosted HPC clusters for industrial customers
Developed custom solutions to support particular customer workloads, including climate simulations, financial modeling, and Formula-One racing
Experience included a wide variety of software stacks, including Red Hat Enterprise Linux, Ubuntu, and Windows 2008 HPC

Skills and Technologies

Technical leader with experience planning large, complex projects and setting technical direction, both as a manager and as a senior/staff individual contributor
Software development experience focused on development of operational tooling and infrastructure services, primarily using Python, Go, and C++
- Scientific software and data analysis experience using Python, Julia, Matlab, and Fortran
- Debugging and hobby projects for applications written in Rust, Clojure, and OCaml
Experienced SRE responsible for large-scale distributed systems (10,000+ nodes) including compute clustgers, distributed storage systems, and microservice-based architectures
Familiar with a wide variety of bare-metal deployment toolkits based on core principles of network boot (PXE) followed by either stateless operation with RAM-based root, or installation of the OS on a local disk
- Tools include Canonical MaaS, NVIDIA BCM, Warewulf, Perceus, xCAT, Cobbler, …
Configuration management using tools such as Ansible, Cfengine, Chef, or Puppet to converge a running system to a desired state
Deployment and performance tuning of parallel and distributed filesystems such as Lustre, Ceph, HDFS, and S3-compatible object storage
Extensive experience with the design and operation of complex monitoring and observability systems, both using custom implementations and with open-source tooling such as Prometheus, Grafana, Loki, and the ELK stack

Education

M.S., University of Illinois at Urbana-Champaign, Materials Science and Engineering.
Thesis title: Fabrication, dynamics, and self-assembly of anisotropic colloidal particles.

B.S., Michigan Technological University, Physics.
Minors in Mathematics and Electronic Materials.

Selected Publications

A. DeConinck, “Architecting and deploying compute clusters for large language models”, 2nd Workshop on Advancing Neural Network Training, International Conference on Machine Learning, July 2024.
A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, M. Mason, J. Brandt, A. Gentile, B. Allan, A. Agelastos, M. Davis and M. Berry. “Design and implementation of a scalable monitoring system for Trinity”, Proc. Cray User’s Group, May 2016.
S. Sanchez, A. Bonnie, G. Van Huele, C. Robinson, A. DeConinck, K. Kelly, Q. Snead and J. Brandt, “Design and Implementation of a Scalable HPC Monitoring System”, Wrk. on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA), 2016 IEEE International Parallel and Distributed Processing Symposim (IPDPS), May 2016.
A. DeConinck and K. Kelly. “Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster”. Wrk. on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA), 2015 IEEE International Conference on Cluster Computing (CLUSTER), September 2015.
A. J. DeConinck. “Tools and Tips for Managing a GPU Cluster”. GPU Technology Conference, March 2014.