环境

显卡:NVIDIA GeForce GTX 1060
图片对话模型:SmolVLM-500M-Instruct

编译

sudo apt-get install libssl-dev libcurl4-openssl-dev

git clone https://github.com/ggml-org/llama.cpp.git
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61
make -j16

在这里Ollama GPU支持 - Nvidia和AMD GPU兼容性 | LlamaFactory | LlamaFactory找到1060的架构是6.1,编译时指定-DCMAKE_CUDA_ARCHITECTURES=61

模型GGUF下载

export HF_HOME=/media/dataset/hf
export HF_ENDPOINT=https://hf-mirror.com
export HUGGINGFACE_TOKEN=xxx

huggingface-cli download ggml-org/SmolVLM-500M-Instruct-GGUF
huggingface-cli download ggml-org/SmolVLM-500M-Instruct-GGUF --local-dir .

运行llama3.cpp

export PATH"$PATH:/path/to/llama.cpp/build/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/llama.cpp/build/bin

llama-server -m ./SmolVLM-500M-Instruct-Q8_0.gguf --mmproj mmproj-SmolVLM-500M-Instruct-Q8_0.gguf --host 0.0.0.0 --port 8080 -ngl 100

输出日志:

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle

Image