Sovereign GPU Pipeline: The Vendor Replacement Story

Vendor replacement story — what is done, what is next, the sovereign stack

What we have already replaced, what we’re replacing next, and what the end state looks like — a pure Rust scientific GPU stack with zero proprietary dependencies.

Last Updated: March 17, 2026
License: CC-BY-SA 4.0


The Claim

Four public primals — BarraCuda (math), ToadStool (orchestration), coralReef (compiler), and coral-glowplug (hardware lifecycle) — together replace the NVIDIA CUDA toolchain for scientific computing. Not all of it yet. But a clear, advancing front that already produces paper-parity lattice QCD on a $500 consumer GPU.

This is not a research prototype. These are production-grade primals with 107,000+ combined test functions across 3.2 million lines of Rust, zero unsafe code, and zero C dependencies in application code.


What’s Already Replaced

1. CUDA Runtime → toadStool + wgpu/Vulkan

CUDA ComponentReplacementStatusTests
cudaGetDeviceProperties()toadStool hardware discovery (multi-adapter, multi-vendor)Production21,156
cudaSetDevice()Capability-based selection (f64 probe, VRAM, workgroup limits)Production
cudaMalloc/cudaFreewgpu buffer management (BarraCuda GuardedDeviceHandle)Production
cudaMemcpywgpu buffer map/unmap with stagingProduction
cudaLaunchKernelqueue.submit() with WGSL compute shadersProduction
cudaDeviceSynchronize()device.poll(Maintain::Wait)Production
Device enumerationtoadstool-sysmon (pure Rust /proc, no sysinfo crate)Production

Key difference: toadStool discovers hardware at runtime by capability, not by vendor ID. The same code discovers NVIDIA, AMD, Intel, and BrainChip NPU. There is no concept of “CUDA device 0” — there is “the device that supports f64 and has >8GB VRAM.”

2. cuBLAS / cuFFT / cuDNN → BarraCuda WGSL Shaders

LibraryBarraCuda EquivalentShadersParityGap
cuBLAS (GEMM)GemmF64, BatchedGemmF64, DF64 GEMM40+Near parity (3.7× Kokkos gap, down from 27×)Throughput on large matrices
cuFFT1D/2D/3D FFT, NTT, INTT20+Full parity for science opsNo cuFFT callback equiv
cuDNN (basic)Conv2D, pooling, attention, softmax, LayerNorm30+Partial — science ML opsNo full ML framework
cuSPARSESpMV, SpMM10+Science opsNot general-purpose
cuRANDLCG, Mersenne Twister5+Full parity
cuSOLVEREigensolve, LU, Cholesky15+Full parity for f64
ThrustReduction, scan, sort20+Full parity

806 WGSL shaders total — every one is f64-canonical (native f64 on pro GPUs, DF64 emulation on consumer GPUs). Key domains:

DomainShader CountExample Operations
Linear algebra80+GEMM, eigensolve, SVD, LU, Cholesky
Statistics60+Welford, Pearson, bootstrap, jackknife
Signal processing40+FFT, convolution, Savitzky-Golay, CWT
Bioinformatics94+Diversity, alignment, phylogeny, DADA2
Physics70+MD, spectral, Anderson, transport
Pharmacometrics30+Hill, PBPK, PopPK, ODE systems
ML primitives50+Attention, GELU, LayerNorm, softmax
Precision40+DF64 arithmetic, transcendentals

3. nvcc / ptxas → coralReef Sovereign Compiler

NVIDIA ToolcoralReef ReplacementStatusEvidence
nvcc (CUDA→PTX)naga WGSL→SPIR-V + custom loweringProduction2,241 tests
ptxas (PTX→SASS)Pure Rust SPIR-V→SASS codegenProduction46/46 shaders compile to SM70/SM86
NVVM (SPIR-V→PTX)Bypassed — 12/12 NVVM poisoning patterns solvedSovereignf64 transcendentals, DF64, FMA
libnvidia-compilerZero dependency — entire compile path is pure RustSovereign#![forbid(unsafe_code)] on glowplug

What “46/46 shaders compile” means: 46 representative barraCuda shaders — covering every domain (bio, physics, ML, linear algebra) — compile from WGSL to native SASS (SM70 Volta, SM86 Ampere) and native RDNA2 (GFX1030) without ANY vendor toolchain. No nvcc, no ptxas, no ROCm. Pure Rust compiler.

4. NVIDIA Kernel Driver → coral-glowplug + VFIO

Driver LayerSovereign ReplacementStatus
nvidia.ko (kernel module)VFIO-pci (upstream Linux kernel)Production
nvidia-drm (display)Not needed (compute-only VFIO)Bypassed
libnvidia-glcoreNot needed (Vulkan/wgpu path)Bypassed
nvidia-uvm (unified memory)Direct VRAM R/W via BAR0Validated (24/26 tests)
Device lifecyclecoral-glowplug (systemd daemon, JSON-RPC)Production
Boot bindingVFIO-first boot (before display manager)Production
Power managementD3hot→D0 sovereign recovery, HBM2 BIOS-trained VRAM survivesValidated
Firmware executionFECS direct execution from host-loaded IMEMProven (Exp 068)

coral-glowplug is a systemd daemon that manages GPU lifecycle without any NVIDIA software. It binds GPUs to vfio-pci at boot, provides hot-swap personality management (VfioPersonality, NouveauPersonality, AmdgpuPersonality), health monitoring, and graceful shutdown — all via JSON-RPC 2.0 over Unix socket.


Current Performance vs CUDA/Kokkos

BenchmarkCUDA/KokkosecoPrimals (wgpu)ecoPrimals (DF64)Gap
Yukawa MD (N=10K, 80K steps)~1 hr (HPC)3.66 hrs (RTX 4070)3.7×
Lattice QCD 32⁴ β-scan13.6 hrs ($0.58)First consumer
Nuclear EOS L1184 s (Python)2.3 s0.012× (80× faster)
f64 throughput9.7 TFLOPS (A100)0.35 TFLOPS (native)3.24 TFLOPS (DF64)
Kokkos Verlet stepper1.0× reference0.27× (3.7× gap)Active optimization

The gap is narrowing: hotSpring Kokkos parity tracking shows 27×→12.4×→3.7× improvement over three months. The remaining gap is primarily in DF64 transcendental functions (exp, log, sin, cos) where NVIDIA’s NVVM has hand-optimized silicon paths. coralReef’s sovereign transcendentals use Newton-Raphson iteration at slightly higher latency.


What’s Coming Next (Near-Term, Given Velocity)

Based on the 27-day sprint velocity (architecture/EVOLUTION_TIMELINE.md), with BarraCuda gaining ~50 new shaders/month and coralReef closing 2–3 NVVM bypass patterns per iteration:

3 Months (June 2026)

TargetWhat Changes
Kokkos gap → <2×DF64 transcendental optimization in coralReef (FMA fusion, Newton-Raphson refinement)
coralReef dispatchFull compute dispatch via VFIO (compile + launch on same sovereign path)
Multi-GPUtoadStool multi-adapter dispatch (RTX 3090 + Titan V in parallel)
barraCuda 1,000 shadersCover remaining cuBLAS L3 ops, sparse ops, Krylov solvers
AMD E2E productionRX 6950 XT full pipeline: compile + dispatch + validate

6 Months (September 2026)

TargetWhat Changes
helixVision Phase DEnd-to-end protein structure prediction pipeline (FASTA→MSA→Evoformer→structure)
AlphaFold timing parity~3 min/sequence on consumer GPU vs ~5 min cloud AlphaFold
Intel backendArc GPUs via coralReef third backend
genomeBin deploymentSelf-extracting single-file sovereign GPU stack
Kokkos gap → <1.5×Approaching throughput parity on science workloads

12 Months (March 2027)

TargetWhat Changes
Full NVVM replacementAll cuBLAS/cuFFT/cuDNN science ops at parity
barraCuda 2,000+ shadersCoverage comparable to CUDA ecosystem for science
coralReef sovereign dispatch productionComplete GPU lifecycle: boot → compile → dispatch → recover
Four-vendor GPU supportNVIDIA, AMD, Intel, Apple (Metal via wgpu)

The Proprietary Cost of What We Replace

ToolCostWhat ecoPrimals Replaces It With
CUDA ToolkitFree (vendor lock)wgpu/Vulkan (open standard, cross-vendor)
cuBLAS/cuFFTFree (NVIDIA-only)BarraCuda 806 WGSL shaders (any GPU)
nvcc compilerFree (NVIDIA-only)coralReef (pure Rust, SM70–SM89 + RDNA2)
NVIDIA driverFree (proprietary binary)coral-glowplug + VFIO (upstream Linux kernel)
A100 GPU~$10K–15KRTX 3090 (~$500 used), DF64: 3.24 TFLOPS
HPC allocation$50–500/run$0.044/run (electricity, consumer GPU)
MATLAB Parallel~$2K/yr + GPU toolboxtoadStool + BarraCuda (AGPL-3.0, free)
KokkosFree (C++, complex build)BarraCuda (Rust, cargo build)
LAMMPSFree (Fortran/C++, HPC)hotSpring (Rust, consumer GPU)

Annual lab savings (conservative): $0 in software licenses (all were technically free) but ~$50K–200K in HPC allocations avoided for a lab running regular plasma, QCD, or bioinformatics GPU workloads. Plus the unmeasurable value of zero-queue 24/7 access and data that never leaves the lab.


The Sovereign Stack Diagram

Application Layer
  └── Springs (wetSpring, hotSpring, neuralSpring, etc.)
      └── Science experiments: reproduce published papers

Math Layer
  └── BarraCuda v0.3.5 (806 WGSL shaders, 3,772 tests)
      └── f64-canonical math: what to compute

Orchestration Layer
  └── ToadStool S157 (96 JSON-RPC methods, 21,156 tests)
      └── Hardware discovery: where and how to compute

Compiler Layer
  └── coralReef Phase 10 Iter 53 (2,241 tests)
      └── WGSL → SPIR-V → native SASS/RDNA2

Hardware Layer
  └── coral-glowplug (systemd daemon, JSON-RPC)
      └── VFIO device lifecycle: boot, bind, dispatch, recover

Nothing above depends on:
  ✗ NVIDIA CUDA toolkit
  ✗ nvcc / ptxas / NVVM
  ✗ nvidia.ko kernel module
  ✗ Any C/C++ library
  ✗ Any cloud service
  ✗ Any vendor-specific API

Every layer is pure Rust, AGPL-3.0, and publicly auditable. The entire scientific computing stack — from math primitives to GPU binary compilation to hardware lifecycle management — belongs to the user.


Repositories:
ecoPrimals/barraCuda · ecoPrimals/toadStool · ecoPrimals/coralReef · syntheticChemistry/hotSpring