run llama-2-70B-chat model on single gpu(在单个gpu上运行llama-2-70B-chat模型)

转载作者：bug小助手更新时间：2023-10-22 13:15:30

35

4

I'm running pytorch on an ubuntu server 18.04 LTS. I have an nvidia gpu with 8 GB or ram. I'd like to experiment with the new llma2-70B-chat model. I'm trying to use peft and bitsandbytes to reduce the hardware requirements as described in the link below:

我在ubuntu服务器18.04 LTS上运行pytorch。我有一个8 GB或内存的nvidia gpu。我想试用新的llma2-70B-cat型号。我正在尝试使用peft和bitsandbytes来减少硬件需求，如下链接所述：

https://www.youtube.com/watch?v=6iHVJyX2e50

https://www.youtube.com/watch?v=6iHVJyX2e50

is it possible to work with the llama-2-70B-chat model on a single gpu with 8GB of ram? I don't care if it's quick, I just want experiment and see what kind of quality responses I can get out of it.

有可能在8GB内存的单个gpu上使用llama-2-70B-chat型号吗？我不在乎它是否快，我只想进行实验，看看我能从中得到什么样的高质量反应。

更多回答

优秀答案推荐

Short answer: No

There is no way to run a Llama-2-70B chat model entirely on an 8 GB GPU alone.
Not even with quantization. (File sizes/ memory sizes of Q2 quantization see below)

没有办法单独在8GB GPU上运行Llama-2-70B聊天模型。即使量化也不行。（Q2量化的文件大小/内存大小见下文）

Your best bet to run Llama-2-70 b is:

运行Llama-2-70 b的最佳选择是：

Long answer: combined with your system memory, maybe

Try out Llama.cpp, or any of the projects based on it, using the .gguf quantizations.

试试Llama.cpp，或者任何基于它的项目，使用.gguf量化。

With Llama.cpp you can run models and offload parts of it to the gpu, with the rest of it running on CPU.
Even then, with the highest available quantization of Q2, which will cause signifikant quality loss of the model, you are required a total of 32 GB memory, which is combined of GPU and system ram - but keep in mind, your system needs up ram too.

使用Llama.cpp，您可以运行模型并将部分模型卸载到gpu，其余部分在CPU上运行。即便如此，在Q2的最高可用量化下，这将导致模型的显著质量损失，您总共需要32 GB的内存，这是GPU和系统ram的结合，但请记住，您的系统也需要上行ram。

It may run anyhow, with your system starting to swap, which will make the answers incredibly slow. And i mean incredibly, because swapping will occur for every single run through the models neural network, resulting in something like several minutes per generated token at least (if not worse).

它可能无论如何都会运行，因为你的系统开始交换，这将使答案变得非常缓慢。我的意思是难以置信，因为在模型神经网络的每一次运行中都会发生交换，导致每个生成的令牌至少需要几分钟（如果不是更糟的话）。

Without swapping, depending on the cpabilities of your system, expect something about 0.5 token /s or slightly above, maybe worse.
Here is the Model-card of the gguf-quantized llama-2-70B chat model, it contains further information how to run it with different software:
TheBloke/Llama-2-70B-chat-GGUF

如果不进行交换，根据系统的cpabilities，预计大约0.5 token/s或稍高，可能更糟。这是gguf量化llama-2-70B聊天模型的模型卡，它包含了如何使用不同软件运行它的进一步信息：TheBloke/Lama-2-70B-chat-GUF

To run the 70B model on 8GB VRAM would be difficult with even quantization. May be you can try running it in Huggingface? You have get quota for single A100 large instance. The smaller models are also fairly capable, give them a shot. For quantization look at llama.cpp project (ggml), works with llama v2 too.

要在8GB VRAM上运行70B模型，即使量化也很困难。也许你可以试着在Huggingface上运行它？您可以获得单个A100大型实例的配额。较小的型号也相当有能力，给他们一个机会。对于量化，请查看llama.cpp项目（ggml），它也适用于llama v2。

With 8GB VRAM you can try running the newer LlamaCode model and also the smaller Llama v2 models. Try the OobaBogga Web UI (its on Github) as a generic frontend with chat interface. But in my experience is a bit slow in generating inference.

使用8GB VRAM，您可以尝试运行较新的LlamaCode型号以及较小的Llama v2型号。试试OobaBogga Web UI（它在Github上）作为一个通用的前端聊天界面。但根据我的经验，在产生推论方面有点慢。

Good luck!

祝你好运

i am running on a single 6gb gpu and a cpu with 176gb ram. here is the python script to run the code. we just choose to ofload a small amount of layers to the gpu. the rest is processed on the cpu and its much slower yet it works.
import os
import ctransformers

我在一个6gb的gpu和一个176gb内存的cpu上运行。下面是运行代码的python脚本。我们只是选择在gpu上添加少量的层。其余的都在cpu上处理，速度要慢得多，但它可以工作。导入os导入ctransformers

Set the path to the model file

model_path = os.path.join(os.getcwd(), "llama-2-70b-chat.Q4_K_M.gguf")

model_path=os.path.join（os.getcwd（），“llama-2-70b-chat.Q4_K_M.gguf”）

Create the AutoModelForCausalLM class

llm = ctransformers.AutoModelForCausalLM.from_pretrained(model_path, model_type="gguf", gpu_layers=5, threads=24, reset=False, context_length=10000, stream=True,max_new_tokens=256, temperature=0.8, repetition_penalty=1.1)

llm=ctransformers。AutoModelForCausalLM.from_pretrained（model_path，model_type=“gguf”，gpu_layers=5，threads=24，reset=False，context_length=10000，stream=True，max_new_tokens=256，temperature=0.8，repeation_file=1.1）

Start a conversation loop

while True:

而True：

Get the user input

user_input = input("Human: ")

user_input=输入（“人：”）

Generate a response

response = llm(user_input)

response=llm（user_input）

Print the response

print("BOT:")
for text in response:
print(text,end="", flush=True)

print（“BOT:”）用于响应中的文本：print（text，end=“”，flush=True）