Skip to content

fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection#8942

Open
sozercan wants to merge 3 commits intomudler:masterfrom
sozercan:fix/gate-cuda-dir-on-gpu-vendor
Open

fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection#8942
sozercan wants to merge 3 commits intomudler:masterfrom
sozercan:fix/gate-cuda-dir-on-gpu-vendor

Conversation

@sozercan
Copy link
Contributor

@sozercan sozercan commented Mar 10, 2026

Summary

  • Fix incorrect CUDA backend selection on CPU-only hosts that have CUDA runtime
    libraries installed (e.g., cuda-cudart-12-5 via apt), which create
    /usr/local/cuda-12 directories as a side effect
  • Reorder checks in getSystemCapabilities() so CUDA directory existence only
    refines the capability when an NVIDIA GPU is actually detected, consistent with
    the arm64 L4T code path that already gates on GPUVendor == Nvidia
  • Add unit tests covering 8 scenarios for the capability detection logic

Problem

Container images that install CUDA runtime libraries create /usr/local/cuda-12
or /usr/local/cuda-13 directories. The previous code checked for these
directories before checking whether a GPU was present, causing CPU-only hosts
to select a CUDA backend. The CUDA backend then crashes because libcuda.so.1 is
absent.

Previous PR that fixed a similar issue: #6149

Changes

pkg/system/capabilities.go — Reordered the non-arm64 path in
getSystemCapabilities():

  1. Check for no GPU → return "default" (early exit)
  2. Check for low VRAM (≤4GB) → return "default" with warning
  3. Check CUDA directories only if GPUVendor == Nvidia
  4. Fall back to GPU vendor string

pkg/system/capabilities_test.go — New file with table-driven tests:

Scenario GPUVendor CUDA dirs Expected
CUDA dir, no GPU "" cuda12 "default"
CUDA 12 + NVIDIA "nvidia" cuda12 "nvidia-cuda-12"
CUDA 13 + NVIDIA "nvidia" cuda13 "nvidia-cuda-13"
Both dirs + NVIDIA "nvidia" both "nvidia-cuda-13"
CUDA dir + AMD "amd" cuda12 "amd"
No CUDA, no GPU "" none "default"
No CUDA + NVIDIA "nvidia" none "nvidia"
CUDA + NVIDIA + low VRAM "nvidia" cuda12 "default"

Test plan

  • go test ./pkg/system/... — new unit tests pass (skipped on darwin as expected)
  • go vet ./pkg/system/... — clean
  • Verify on a CPU-only container with CUDA runtime libs installed that capability resolves to "default" instead of "nvidia-cuda-12"

@netlify
Copy link

netlify bot commented Mar 10, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit cd4558e
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/69b087675955210008a8fdd5
😎 Deploy Preview https://deploy-preview-8942--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@sozercan sozercan force-pushed the fix/gate-cuda-dir-on-gpu-vendor branch from 4c4ab1e to 75d9ce8 Compare March 10, 2026 20:36
@sozercan sozercan marked this pull request as ready for review March 10, 2026 20:36
…etection

Container images that install CUDA runtime libraries (e.g., cuda-cudart-12-5
via apt) create /usr/local/cuda-12 directories as a side effect. The previous
code checked for these directories before checking whether a GPU was present,
causing CPU-only hosts to select a CUDA backend that crashes because
libcuda.so.1 is absent.

Reorder checks so CUDA directory existence only refines the capability when
an NVIDIA GPU is actually detected, consistent with the arm64 L4T code path.

Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
@sozercan sozercan force-pushed the fix/gate-cuda-dir-on-gpu-vendor branch from 75d9ce8 to 47fb0f5 Compare March 10, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant