gpt4 book ai didi

run llama-2-70B-chat model on single gpu(在单GPU上运行Llama-2-70B-Chat模型)

转载 作者:bug小助手 更新时间:2023-10-25 10:00:55 32 4
gpt4 key购买 nike



I'm running pytorch on an ubuntu server 18.04 LTS. I have an nvidia gpu with 8 GB or ram. I'd like to experiment with the new llma2-70B-chat model. I'm trying to use peft and bitsandbytes to reduce the hardware requirements as described in the link below:

我在ubuntu服务器18.04lts上运行pytorch。我有一台8 GB或内存的NVIDIA图形处理器。我想试验一下新的llma2-70B聊天模式。我正在尝试使用peft和bitandbytes来降低硬件要求,如下面的链接所述:


https://www.youtube.com/watch?v=6iHVJyX2e50

Https://www.youtube.com/watch?v=6iHVJyX2e50


is it possible to work with the llama-2-70B-chat model on a single gpu with 8GB of ram? I don't care if it's quick, I just want experiment and see what kind of quality responses I can get out of it.

有没有可能在配备8 GB内存的单一图形处理器上使用骆驼2-70B-Chat机型?我不在乎它是不是快,我只想要实验,看看我能得到什么样的高质量的回应。


更多回答
优秀答案推荐

Short answer: No


There is no way to run a Llama-2-70B chat model entirely on an 8 GB GPU alone.
Not even with quantization. (File sizes/ memory sizes of Q2 quantization see below)

有没有办法运行一个骆驼-2-70B聊天模型完全在一个8 GB的图形处理器单独。即使是量化也不行。(Q2量化的文件大小/内存大小见下文)


Your best bet to run Llama-2-70 b is:

你运行骆驼2-70b的最好办法是:


Long answer: combined with your system memory, maybe


Try out Llama.cpp, or any of the projects based on it, using the .gguf quantizations.

尝试使用.gguf Quantizations的Llama.cpp或任何基于它的项目。


With Llama.cpp you can run models and offload parts of it to the gpu, with the rest of it running on CPU.
Even then, with the highest available quantization of Q2, which will cause signifikant quality loss of the model, you are required a total of 32 GB memory, which is combined of GPU and system ram - but keep in mind, your system needs up ram too.

使用Llama.cpp,您可以运行模型并将其部分卸载到GPU,而其余部分则在CPU上运行。即使到那时,Q2的最高可用量化将导致模型的显著质量损失,您需要总共32 GB的内存,这是GPU和系统内存的组合-但请记住,您的系统也需要内存。


It may run anyhow, with your system starting to swap, which will make the answers incredibly slow. And i mean incredibly, because swapping will occur for every single run through the models neural network, resulting in something like several minutes per generated token at least (if not worse).

它可能以任何方式运行,随着您的系统开始交换,这将使答案变得令人难以置信地慢。我的意思是难以置信,因为模型神经网络的每一次运行都会发生交换,导致每个生成的令牌至少需要几分钟(如果不是更糟的话)。


Without swapping, depending on the cpabilities of your system, expect something about 0.5 token /s or slightly above, maybe worse.
Here is the Model-card of the gguf-quantized llama-2-70B chat model, it contains further information how to run it with different software:
TheBloke/Llama-2-70B-chat-GGUF

如果没有交换,根据您系统的能力,预计会有大约0.5令牌/S或略高于,可能更糟。这是GGUF量化的Llama-2-70B聊天模型的模型卡,它包含了如何用不同的软件运行它的进一步信息:TheBloke/Llama-2-70B-Chat-GGUF



To run the 70B model on 8GB VRAM would be difficult with even quantization. May be you can try running it in Huggingface? You have get quota for single A100 large instance. The smaller models are also fairly capable, give them a shot. For quantization look at llama.cpp project (ggml), works with llama v2 too.

要在8 GB VRAM上运行70B模型,即使是量化也会很困难。也许你可以试着在HuggingFace中运行它?您已经获得了单个A100大型实例的配额。较小的型号也相当有能力,给他们一个机会。有关量化信息,请参阅llama.cpp项目(Ggml),它也适用于llama v2。


With 8GB VRAM you can try running the newer LlamaCode model and also the smaller Llama v2 models. Try the OobaBogga Web UI (its on Github) as a generic frontend with chat interface. But in my experience is a bit slow in generating inference.

有了8 GB的VRAM,你可以尝试运行较新的LlamaCode型号和较小的Llama v2型号。试试OobaBogga Web UI(它在Github上)作为带有聊天界面的通用前端。但在我的经验中,产生推理的速度有点慢。


Good luck!

祝好运!



i am running on a single 6gb gpu and a cpu with 176gb ram. here is the python script to run the code. we just choose to ofload a small amount of layers to the gpu. the rest is processed on the cpu and its much slower yet it works.
import os
import ctransformers

我在一个6 GB的GPU和一个176 GB内存的CPU上运行。下面是用来运行代码的python脚本。我们只是选择将少量的层加载到GPU上。其余的在CPU上处理,它的速度要慢得多,但它仍然有效。导入操作系统导入变压器


Set the path to the model file


model_path = os.path.join(os.getcwd(), "llama-2-70b-chat.Q4_K_M.gguf")

Model_Path=os.path.Join(os.getcwd(),“llama-2-70b-chat.Q4_K_M.gguf”)


Create the AutoModelForCausalLM class


llm = ctransformers.AutoModelForCausalLM.from_pretrained(model_path, model_type="gguf", gpu_layers=5, threads=24, reset=False, context_length=10000, stream=True,max_new_tokens=256, temperature=0.8, repetition_penalty=1.1)

Llm=ctransformers.AutoModelForCausalLM.from_pretrained(model_path,MODEL_TYPE=“GGUF”,GPU_LAYERS=5,线程=24,RESET=FALSE,CONTEXT_LENGTH=10000,STREAM=TRUE,MAX_NEW_TOKENS=256,温度=0.8,REPEATION_PINDY=1.1)


Start a conversation loop


while True:

当为True时:


Get the user input


user_input = input("Human: ")

USER_INPUT=INPUT(“Human:”)


Generate a response


response = llm(user_input)

响应= llm(user_input)


Print the response


print("BOT:")
for text in response:
print(text,end="", flush=True)

为响应中的文本打印(“bot:”):print(Text,end=“”,flush=True)


更多回答

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com