llama-cpp-python/docker at main · etjson/llama-cpp-python

Name	Name	Last commit message	Last commit date
parent directory ..
cuda_simple	cuda_simple
open_llama	open_llama
openblas_simple	openblas_simple
simple	simple
README.md	README.md

Name

Last commit message

Last commit date

Install Docker Server

Important

This was tested with Docker running on Linux.
If you can get it working on Windows or MacOS, please update this README.md with a PR!

Install Docker Engine

Simple Dockerfiles for building the llama-cpp-python server with external model bin files

openblas_simple

A simple Dockerfile for non-GPU OpenBLAS, where the model is located outside the Docker image:

cd ./openblas_simple
docker build -t openblas_simple .
docker run --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t openblas_simple

where <model-root-path>/<model-path> is the full path to the model file on the Docker host system.

cuda_simple

Warning

NVIDIA Container Toolkit: You must have the NVIDIA Container Toolkit installed on the host. The 12.8.1-cudnn-devel-ubuntu22.04 images currently in use generally include the necessary NVCC compilation environment.
VRAM: Ensure your GPU has enough VRAM to load the model.

A Dockerfile that builds llama-cpp-python from source (with CUDA 12.8 support) and runs an OpenAI-compatible API server.

1. Build

Note: The build process will compile the llama.cpp C++ backend, which may take several tens of minutes.

cd ./cuda_simple
docker build -t cuda_simple .

2. Run

docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/app/models/<model-path> -v /path/to/your/models:/app/models -t cuda_simple

--gpus=all: Enables GPU access.
-e MODEL=...: Specifies the path to the model inside the container.

"Open-Llama-in-a-box"

Download an Apache V2.0 licensed 3B params Open LLaMA model and install into a Docker image that runs an OpenBLAS-enabled llama-cpp-python server:

$ cd ./open_llama
./build.sh
./start.sh

Manually choose your own Llama model from Hugging Face

python3 ./hug_model.py -a TheBloke -t llama You should now have a model in the current directory and model.bin symlinked to it for the subsequent Docker build and copy step. e.g.

docker $ ls -lh *.bin
-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>q5_1.bin
lrwxrwxrwx 1 user user   24 May 23 18:30 model.bin -> <downloaded-model-file>q5_1.bin

Note

Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least TWICE as much disk space as the size of the model:

Model	Quantized size
3B	3 GB
7B	5 GB
13B	10 GB
33B	25 GB
65B	50 GB

Note

If you want to pass or tune additional parameters, customise ./start_server.sh before running docker build ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Install Docker Server

Simple Dockerfiles for building the llama-cpp-python server with external model bin files

openblas_simple

cuda_simple

1. Build

2. Run

"Open-Llama-in-a-box"

Manually choose your own Llama model from Hugging Face

FilesExpand file tree

docker

Directory actions

More options

Directory actions

More options

Latest commit

History

docker

Folders and files

parent directory

README.md

Install Docker Server

Simple Dockerfiles for building the llama-cpp-python server with external model bin files

openblas_simple

cuda_simple

1. Build

2. Run

"Open-Llama-in-a-box"

Manually choose your own Llama model from Hugging Face