随着ChatGPT被大规模的使用,现在各种大语言模型陆续涌现,几个月前我尝试了一下,发现比较笨,和Openai提供的服务完全没法比,就失去了兴趣。最近有人在issue区里提到matrix_chatgpt_bot的作图功能在使用LocalAI后端时无法正常工作。为了解决这个问题,我就顺带又测试了一些新的聊天和作图模型,被它们的发展速度震惊到了。
下面将介绍LocalAI的搭建,以及安装大语言模型(mistral)和作图模型(sdxl-turbo)的方法。
安装LocalAI
这里为了方便,我们采用docker部署相关服务。
假设我们的工作路径是~/localai
,现在创建一个 compose.yaml 文件
services:
api:
image: quay.io/go-skynet/local-ai:latest
tty: true
ports:
- 127.0.0.1:12345:8080
env_file:
- .env
volumes:
- ./models:/models:cached
- ./images/:/tmp/generated/images/
command: ["/usr/bin/local-ai" ]
再创建 .env 文件,与默认配置相比,这里THREADS
需要小于或等于CPU的核心数,COMPEL=0
的注释要去掉
## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
THREADS=4
## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080
## Default models context size
# CONTEXT_SIZE=512
#
## Define galleries.
## models will to install will be visible in `/models/available`
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]
## CORS settings
# CORS=true
# CORS_ALLOW_ORIGINS=*
## Default path for models
MODELS_PATH=/models
## Enable debug mode
DEBUG=true
## Disables COMPEL (Diffusers)
COMPEL=0
## Enable/Disable single backend (useful if only one GPU is available)
# SINGLE_ACTIVE_BACKEND=true
## Specify a build type. Available: cublas, openblas, clblas.
## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
## clBLAS: This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
BUILD_TYPE=openblas
## Uncomment and set to true to enable rebuilding from source
# REBUILD=true
## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper
## (requires REBUILD=true)
#
# GO_TAGS=stablediffusion
## Path where to store generated images
# IMAGE_PATH=/tmp
## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT
## List of external GRPC backends (note on the container image this variable is already set to use extra backends available in extra/)
# EXTERNAL_GRPC_BACKENDS=my-backend:127.0.0.1:9000,my-backend2:/usr/bin/backend.py
### Advanced settings ###
### Those are not really used by LocalAI, but from components in the stack ###
##
### Preload libraries
# LD_PRELOAD=
### Huggingface cache for models
# HUGGINGFACE_HUB_CACHE=/usr/local/huggingface
### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
# PYTHON_GRPC_MAX_WORKERS=1
### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
# LLAMACPP_PARALLEL=1
### Enable to run parallel requests
# PARALLEL_REQUESTS=true
### Watchdog settings
###
# Enables watchdog to kill backends that are inactive for too much time
# WATCHDOG_IDLE=true
#
# Enables watchdog to kill backends that are busy for too much time
# WATCHDOG_BUSY=true
#
# Time in duration format (e.g. 1h30m) after which a backend is considered idle
# WATCHDOG_IDLE_TIMEOUT=5m
#
# Time in duration format (e.g. 1h30m) after which a backend is considered busy
# WATCHDOG_BUSY_TIMEOUT=5m
然后启动容器
docker compose up -d
由于镜像非常的大,下载和解压需要时间,请耐心等待容器运行起来。
安装大语言模型
这里我选用的是mistral模型,这也是我测试下来发现性能最接近gpt-3.5-turbo的模型。得益于LocalAI模块化的设计,开发者设计了一套模型仓库,里面提供了很多开箱即用的配置文件。因此我们只需要一条命令就能安装mistral模型。
curl http://localhost:12345/models/apply -H "Content-Type: application/json" -d '{ "id": "model-gallery@mistral" }'
你可以查看容器的日志判断模型是否下载完毕
docker compose logs
下载好后,你会在models目录下看到模型以及相关的配置文件。
这里我用 mistral-ft-optimized-1218.Q5_K_S.gguf 替换了默认的 mistral-7b-openorca.Q6_K.gguf 模型,你只需要编辑 models/mistral.yaml 文件,修改parameters->model 配置 (也可以不做更改,使用默认的即可)
最后使用curl
在终端里测试一下
curl http://localhost:12345/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "mistral",
"messages": [{"role": "user", "content": "你好呀?" }]
}'
安装sdxl-turbo作图模型
该模型的优点是出图快,质量也还好。只需要一个步长(step)就能得到结果,在我的Intel 4核 N100 CPU上,大概20多秒就能出一张图。
在 models 目录下创建 stablediffusion.yaml 文件
name: stablediffusion
parameters:
model: stabilityai/sdxl-turbo
backend: diffusers
step: 1
cuda: false
# Force CPU usage - set to true for GPU
f16: false
diffusers:
scheduler_type: euler_a
cfg_scale: 1
然后重启容器
docker compose restart
接着使用下面的命令生成第一张图片,程序会在模型不存在时自动帮你下载
curl http://127.0.0.1:12345/v1/images/generations -H "Content-Type: application/json" -d '{
"prompt": "face focus, cute, masterpiece, best quality, 1girl, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck",
"size": "512x512"
}'
耐心等待模型下载完毕,就会开始作图了。
到这里作图模型算是安装完了。但是!!!
如果你细心的话,进入到 models 目录下,你会发现没有找到sdxl-turbo的模型,哪怕进入到容器里也是。这会导致什么问题呢?一旦你销毁了容器,比如使用了docker compose down
,之后再次启用容器就必须重新下载sdxl-turbo模型,这十分的浪费时间,因此我们使用commit
命令额外处理一下,创建一个新的镜像。
fd3782b2c44d 请替换为你的容器ID,可以使用 docker ps
获取
docker commit fd3782b2c44d local-ai-sd
之后将 compose.yaml 文件中的 quay.io/go-skynet/local-ai:latest
改为 local-ai-sd
,再使用下面的命令重建容器即可
docker compose down && docker compose up -d
2024-1-4 补充说明: 找到huggingface的缓存在 /root/.cache/huggingface
,可以把这个路径挂载一下,就不需要上面的操作了。
结语
本来我是想直接编译使用的,但被编译问题给整麻了,最后图省事用了docker,当然代价就是用空间换便利了,70多G的镜像!!!
希望上面的教程能够给那些想要使用LocalAI部署相关服务的朋友提供一些帮助。
发表回复