fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection#8942
Open
sozercan wants to merge 3 commits intomudler:masterfrom
Open
fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection#8942sozercan wants to merge 3 commits intomudler:masterfrom
sozercan wants to merge 3 commits intomudler:masterfrom
Conversation
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
4c4ab1e to
75d9ce8
Compare
…etection Container images that install CUDA runtime libraries (e.g., cuda-cudart-12-5 via apt) create /usr/local/cuda-12 directories as a side effect. The previous code checked for these directories before checking whether a GPU was present, causing CPU-only hosts to select a CUDA backend that crashes because libcuda.so.1 is absent. Reorder checks so CUDA directory existence only refines the capability when an NVIDIA GPU is actually detected, consistent with the arm64 L4T code path. Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
75d9ce8 to
47fb0f5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
libraries installed (e.g.,
cuda-cudart-12-5via apt), which create/usr/local/cuda-12directories as a side effectgetSystemCapabilities()so CUDA directory existence onlyrefines the capability when an NVIDIA GPU is actually detected, consistent with
the arm64 L4T code path that already gates on
GPUVendor == NvidiaProblem
Container images that install CUDA runtime libraries create
/usr/local/cuda-12or
/usr/local/cuda-13directories. The previous code checked for thesedirectories before checking whether a GPU was present, causing CPU-only hosts
to select a CUDA backend. The CUDA backend then crashes because
libcuda.so.1isabsent.
Previous PR that fixed a similar issue: #6149
Changes
pkg/system/capabilities.go— Reordered the non-arm64 path ingetSystemCapabilities():"default"(early exit)"default"with warningGPUVendor == Nvidiapkg/system/capabilities_test.go— New file with table-driven tests:"""default""nvidia""nvidia-cuda-12""nvidia""nvidia-cuda-13""nvidia""nvidia-cuda-13""amd""amd""""default""nvidia""nvidia""nvidia""default"Test plan
go test ./pkg/system/...— new unit tests pass (skipped on darwin as expected)go vet ./pkg/system/...— clean"default"instead of"nvidia-cuda-12"