环境

显卡：NVIDIA GeForce GTX 1060
图片对话模型：SmolVLM-500M-Instruct

编译

sudo apt-get install libssl-dev libcurl4-openssl-dev

git clone https://github.com/ggml-org/llama.cpp.git
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61
make -j16

在这里Ollama GPU支持 - Nvidia和AMD GPU兼容性 | LlamaFactory | LlamaFactory找到1060的架构是6.1，编译时指定-DCMAKE_CUDA_ARCHITECTURES=61

模型GGUF下载

export HF_HOME=/media/dataset/hf
export HF_ENDPOINT=https://hf-mirror.com
export HUGGINGFACE_TOKEN=xxx

huggingface-cli download ggml-org/SmolVLM-500M-Instruct-GGUF
huggingface-cli download ggml-org/SmolVLM-500M-Instruct-GGUF --local-dir .

运行llama3.cpp

export PATH"$PATH:/path/to/llama.cpp/build/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/llama.cpp/build/bin

llama-server -m ./SmolVLM-500M-Instruct-Q8_0.gguf --mmproj mmproj-SmolVLM-500M-Instruct-Q8_0.gguf --host 0.0.0.0 --port 8080 -ngl 100

输出日志:

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle

用浏览器打开网址，比如 192.168.3.100:1080，即可以进行图片对话。1060的6G显卡，运行500M的模型，吐字飞快。
500M的模型，用CPU跑也不慢。另外，还可以尝试更小的200M的模型。

llama3.cpp基于NVIDIA1060进行图片对话

环境

编译

模型GGUF下载

运行llama3.cpp