llama.cpp推理流程和常用函数介绍

转载作者：撒哈拉更新时间：2024-10-05 16:20:58

llama.cpp是一个高性能的CPU/GPU大语言模型推理框架，适用于消费级设备或边缘设备。开发者可以通过工具将各类开源大语言模型转换并量化成gguf格式的文件，然后通过llama.cpp实现本地推理。经过我的调研，相比较其它大模型落地方案，中小型研发企业使用llama.cpp可能是唯一的产品落地方案。关键词：“中小型研发企业”，“产品落地方案”.

中小型研发企业：相较动辄千万+的硬件投入，中小型研发企业只能支撑少量硬件投入，并且也缺少专业的研发人员.

产品落地方案：项目需要具备在垂直领域落地的能力，大多数情况下还需要私有化部署.

网上有不少介绍的文章，B站上甚至有一些收费课程。但是版本落后较多，基本已经没有参考价值。本文采用b3669版本，发布日期是2024年9月，参考代码：examples/main.cpp。由于作者(Georgi Gerganov)没有提供详细的接口文档，examples的代码质量也确实不高，因此学习曲线比较陡峭。本文旨在介绍如何使用llama.cpp进行推理和介绍重点函数，帮助开发人员入门，深入功能还有待研究.

1、推理流程

1. 过程描述

以常见的交互推理为例，程序大概可以分成5个子功能模块.

初始化：模型和系统提示词初始化。其实从程序处理过程上分析，并没有特别区分系统提示词与用户输入，实际项目开发中完全可以放在一起处理。后面会再解释它们在概念上的区别.

用户输入：等待用户输入文本信息。大语言模型其实就是对人类的文本信息进行分析和理解的过程，而产品落地的本质就是借助大模型的理解进一步完成一些指定任务。在这个过程中，互联网上又造了许多概念，什么agent，function等。其实本质上都是在研究如何将大模型与程序进一步结合并完成交互。至少目前，我的观点是：大模型仅具备语义分析，语义推理的能力.

分析预测：这个是大语言模型的核心能力之一，它需要分析上下文（系统提示词、用户输入、已推理的内容）再进一步完成下一个词语（token）的预测.

推理采样：这个是大语言模型的另一个核心能力，它需要从分析预测的结果中随机选择一个token，并将它作为输入反向发送给分析预测模块继续进行，直到输出结束（EOS）.

输出：这个模块严格说不属于大模型，但是它又是完成用户交互必须模块。从产品设计上，可以选择逐字输出（token-by-token）或者一次性输出（token-by-once）.

2. 概念介绍

角色（roles）：大语言模型通常会内置三种角色：系统（system），用户（user），助手（assistant）。这三种角色并非所有模型统一指定，但是基本目前所有开源的大模型都兼容这三种角色的交互，它有助于大模型更好的理解人类语境并完成任务。system表示系统提示词，就是我们常说的prompt。网上有不少课程将写系统提示词描述为提示词工程，还煞有介事的进行分类，其实大可不必。从我的使用经验看，一个好的系统提示词（prompt）应具备三个要点即可：语义明确，格式清晰，任务简单。语义明确即在系统提示词中尽量不要使用模棱两可的词语，用人话说就是“把问题说清楚”。格式清晰即可以使用markdown或者json指定一些重要概念。如果你需要让大模型按照某个固定流程进行分析，可以使用markdown的编号语法，如果你需要将大模型对推理结果进行结构化处理，可以使用json语法。任务简单即不要让大模型处理逻辑太复杂或者流程太多的任务。大模型的推理能力完全基于语义理解，它并不具备严格意义上的程序执行逻辑和数学运算逻辑。这就是为什么，当你问大模型：1.11和1.8谁大的时候，它会一本正经的告诉你，当整数部分一样大的时候，仅需要比较小数部分，因为11大于8，因此1.11大于1.8。那么如果我们现实中确实有一些计算任务或复杂的流程需要处理怎么办？我的解决方案是，与程序交互和动态切换上下文。除了系统角色以外，用户一般代表输入和助手一般代表输出.

token：这里不要理解为令牌，它的正确解释应该是一组向量的id。就是常见的描述大模型上下文长度的单位。一个token代表什么？互联网上有很多错误的解释，比较常见说法是：一个英文单词为1个token，一个中文通常是2-3个token。上面的流程介绍一节，我已经解释了“分析预测”与“采样推理”如何交互。“推理采样”生成1个token，反向输送给“分析预测”进行下一个token的预测，而输出模块可以选择token-by-token的方式向用户输出。实际上，对于中文而言，一个token通常表示一个分词。例如：“我爱中国”可能的分词结果是“我”，“爱”，“中国”也可能是“我”，“爱”，“中”，“国”。前者代表3个token，后者代表4个token。具体如何划分，取决于大模型的中文指令训练。除了常见的代表词语的token以外，还有一类特殊token（special token），例如上文提到的，大模型一个字一个字的进行推理生成，程序怎么知道何时结束？其实是有个eos-token，当读到这个token的时候，即表示本轮推理结束了.

3. 程序结构

llama.cpp的程序结构比较清晰，核心模块是llama和ggmll。ggml通过llama进行调用，开发通常不会直接使用。在llama中定义了常用的结构体和函数。common是对llama中函数功能的再次封装，有时候起到方便调用的目的。但是版本迭代上，common中的函数变化较快，最好的方法是看懂流程后直接调用llama.h中的函数.

4. 源码分析

下面我以examples/main/main.cpp作为基础做重点分析.

(1) 初始化

全局参数，这个结构体主要用来接收用户输入和后续用来初始化模型与推理上下文.

gpt_params params;

系统初始化函数:

llama_backend_init();
llama_numa_init(params.numa);

系统资源释放函数:

llama_backend_free();

创建模型和推理上下文:

llama_init_result llama_init = llama_init_from_gpt_params(params);

llama_model *model = llama_init.model;
llama_context *ctx = llama_init.context;

它声明在common.h中。如果你需要将模型和上下文分开创建可以使用llama.h中的另外两对函数:

llama_model_params model_params = llama_model_params_from_gpt_params(gpt_params_);
llama_model_ = llama_load_model_from_file(param.model.c_str(), model_params);

llama_context_params ctx_eval_params = llama_context_params_from_gpt_params(gpt_params_);
llama_context *ctx_eval = llama_new_context_with_model(llama_model_, ctx_eval_params);

创建ggml的线程池，这个过程可能和模型加速有关，代码中没有对它的详细解释:

struct ggml_threadpool * threadpool = ggml_threadpool_new(&tpp);

llama_attach_threadpool(ctx, threadpool, threadpool_batch);

除了完成一般的推理任务，llama.cpp还实现了上下文存储与读取。上下文切换的前提是不能换模型，且仅首次推理接收用户输入的prompt。利用这个特性，可以实现上下文的动态切换.

std::string path_session = params.path_prompt_cache;
std::vector<llama_token> session_tokens;

至此，有关系统初始化模块的过程已经完成.

(2) 用户输入

为了接收用户输入和推理输出，源码集中定义了几个变量:

std::vector<llama_token> embd_inp;

std::vector<llama_token> embd;

检查编码器，现代模型大多都没有明确定义的encodec 。

if (llama_model_has_encoder(model)) {
    int enc_input_size = embd_inp.size();
    llama_token * enc_input_buf = embd_inp.data();
    if (llama_encode(ctx, llama_batch_get_one(enc_input_buf, enc_input_size, 0, 0))) {
        LOG_TEE("%s : failed to eval\n", __func__);
        return 1;
    }
    llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
    if (decoder_start_token_id == -1) {
        decoder_start_token_id = llama_token_bos(model);
    }

    embd_inp.clear();
    embd_inp.push_back(decoder_start_token_id);
}

(3) 分析预测

分析预测部分的核心代码如下，我将处理关注力和session的逻辑删除，仅保留推理部分的逻辑.

// predict
if (!embd.empty()) {
    // Note: (n_ctx - 4) here is to match the logic for commandline prompt handling via
    // --prompt or --file which uses the same value.
    int max_embd_size = n_ctx - 4;

    // Ensure the input doesn't exceed the context size by truncating embd if necessary.
    if ((int) embd.size() > max_embd_size) {
        const int skipped_tokens = (int) embd.size() - max_embd_size;
        embd.resize(max_embd_size);

        console::set_display(console::error);
        printf("<<input too long: skipped %d token%s>>", skipped_tokens, skipped_tokens != 1 ? "s" : "");
        console::set_display(console::reset);
        fflush(stdout);
    }

    for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
        int n_eval = (int) embd.size() - i;
        if (n_eval > params.n_batch) {
            n_eval = params.n_batch;
        }

        LOG("eval: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd).c_str());

        if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))) {
            LOG_TEE("%s : failed to eval\n", __func__);
            return 1;
        }

        n_past += n_eval;

        LOG("n_past = %d\n", n_past);
        // Display total tokens alongside total time
        if (params.n_print > 0 && n_past % params.n_print == 0) {
            LOG_TEE("\n\033[31mTokens consumed so far = %d / %d \033[0m\n", n_past, n_ctx);
        }
    }
}

embd.clear();

逻辑的重点是：首先，如果推理的上下文长度超限，会丢弃超出部分。实际开发中可以考虑重构这个部分的逻辑。其次，每次推理都有一个处理数量限制（n_batch），这主要是为了当一次性输入的内容太多，系统不至于长时间无响应。最后，每次推理完成，embd都会被清理，推理完成后的信息会保存在ctx中.

(4) 推理采样

采样推理部分的源码分两个部分:

if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
    // optionally save the session on first sample (for faster prompt loading next time)
    if (!path_session.empty() && need_to_save_session && !params.prompt_cache_ro) {
        need_to_save_session = false;
        llama_state_save_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());

        LOG("saved session to %s\n", path_session.c_str());
    }

    const llama_token id = llama_sampling_sample(ctx_sampling, ctx, ctx_guidance);

    llama_sampling_accept(ctx_sampling, ctx, id, /* apply_grammar= */ true);

    LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev).c_str());

    embd.push_back(id);

    // echo this to console
    input_echo = true;

    // decrement remaining sampling budget
    --n_remain;

    LOG("n_remain: %d\n", n_remain);
} else {
    // some user input remains from prompt or interaction, forward it to processing
    LOG("embd_inp.size(): %d, n_consumed: %d\n", (int) embd_inp.size(), n_consumed);
    while ((int) embd_inp.size() > n_consumed) {
        embd.push_back(embd_inp[n_consumed]);

        // push the prompt in the sampling context in order to apply repetition penalties later
        // for the prompt, we don't apply grammar rules
        llama_sampling_accept(ctx_sampling, ctx, embd_inp[n_consumed], /* apply_grammar= */ false);

        ++n_consumed;
        if ((int) embd.size() >= params.n_batch) {
            break;
        }
    }
}

首先要关注第2部分，这一段的逻辑是将用户的输入载入上下文中，由于用户的输入不需要推理，因此只需要调用llama_sampling_accept函数。第1部分只有当用户输入都完成以后才会进入，每次采样一个token，写进embd。这个过程和分析预测交替进行，直到遇到eos.

if (llama_token_is_eog(model, llama_sampling_last(ctx_sampling))) {
    LOG("found an EOG token\n");

    if (params.interactive) {
        if (params.enable_chat_template) {
            chat_add_and_format(model, chat_msgs, "assistant", assistant_ss.str());
        }
        is_interacting = true;
        printf("\n");
    }
}

chat_add_and_format函数只负责将所有交互过程记录在char_msgs中，对整个推理过程没有影响。如果要实现用户输出，可以在这里处理.

2、关键函数

通过gpt_params初始化llama_model_params 。

struct llama_model_params     llama_model_params_from_gpt_params    (const gpt_params & params);

创建大模型指针。

LLAMA_API struct llama_model * llama_load_model_from_file(
                             const char * path_model,
            struct llama_model_params     params);

创建ggml线程池和设置线程池。

GGML_API struct ggml_threadpool*         ggml_threadpool_new          (struct ggml_threadpool_params  * params);
LLAMA_API void llama_attach_threadpool(
               struct   llama_context * ctx,
            ggml_threadpool_t   threadpool,
            ggml_threadpool_t   threadpool_batch);

通过gpt_params初始化llama_context_params 。

struct llama_context_params   llama_context_params_from_gpt_params  (const gpt_params & params);

LLAMA_API struct llama_context * llama_new_context_with_model(
                     struct llama_model * model,
            struct llama_context_params   params);

对输入进行分词并转换成token 。

std::vector<llama_token> llama_tokenize(
  const struct llama_context * ctx,
           const std::string & text,
                        bool   add_special,
                        bool   parse_special = false);

获取特殊token 。

LLAMA_API llama_token llama_token_bos(const struct llama_model * model); // beginning-of-sentence
LLAMA_API llama_token llama_token_eos(const struct llama_model * model); // end-of-sentence
LLAMA_API llama_token llama_token_cls(const struct llama_model * model); // classification
LLAMA_API llama_token llama_token_sep(const struct llama_model * model); // sentence separator
LLAMA_API llama_token llama_token_nl (const struct llama_model * model); // next-line
LLAMA_API llama_token llama_token_pad(const struct llama_model * model); // padding

批量处理token并进行预测。

LLAMA_API struct llama_batch llama_batch_get_one(
                  llama_token * tokens,
                      int32_t   n_tokens,
                    llama_pos   pos_0,
                 llama_seq_id   seq_id);

LLAMA_API int32_t llama_decode(
            struct llama_context * ctx,
              struct llama_batch   batch);

执行采样和接收采样。

llama_token llama_sampling_sample(
        struct llama_sampling_context * ctx_sampling,
        struct llama_context * ctx_main,
        struct llama_context * ctx_cfg,
        int idx = -1);

void llama_sampling_accept(
        struct llama_sampling_context * ctx_sampling,
        struct llama_context * ctx_main,
        llama_token id,
        bool apply_grammar);

将token转成自然语言。

std::string llama_token_to_piece(
        const struct llama_context * ctx,
                       llama_token   token,
                       bool          special = true);

判断推理是否结束，注意，这个token可能和llama_token_eos获取的不一致。因此一定要通过这个函数判断。

// Check if the token is supposed to end generation (end-of-generation, eg. EOS, EOT, etc.)
LLAMA_API bool llama_token_is_eog(const struct llama_model * model, llama_token token);

3、总结

本文旨在介绍llama.cpp的基础用法，由于Georgi Gerganov更新较快，且缺少文档。因此可能有些解释不够准确。如果大家对框架和本文敢兴趣可以给我留言深入讨论.

最后此篇关于llama.cpp推理流程和常用函数介绍的文章就讲到这里了,如果你想了解更多关于llama.cpp推理流程和常用函数介绍的内容请搜索CFSDN的文章或继续浏览相关文章，希望大家以后支持我的博客！。

文章推荐：在Windows平台使用源码编译和安装PyTorch3D指定版本

文章推荐： Windows应急响应-Auto病毒

文章推荐：【VMwareVCF】使用PowerVCF连接和管理VMwareCloudFoundation环境。

文章推荐： [kubernetes]二进制方式部署单机k8s-v1.30.5

c++ - 类模板 cpp、hpp、cpp
这个问题在这里已经有了答案: Why can templates only be implemented in the header file? (18 个答案) 关闭 7 年前。我的 .hpp
yaml-cpp - 如何使用 yaml-cpp 发出带引号的字符串？
我想使用 yaml-cpp 发出一个带引号的字符串，所以它看起来像时间戳:“2011 年 8 月 10 日 01:37:52” 在输出yaml文件中。我该怎么做？谢谢。最佳答案 YAML::Emi
c++ - 即使我在 .cpp 文件中实例化虚拟对象，.cpp 文件内的模板函数定义也不起作用
我理解了模板的概念以及为什么我们需要在头文件中定义模板成员函数。另一种选择是在 cpp 文件中定义模板函数并显式实例化模板类，如下所示。模板.h #include using namespace
yaml-cpp - 如何使用 yaml-cpp 发出和解析原始二进制数据
是否可以发出和读取(解析)二进制数据(图像、文件等)？如下所示: http://yaml.org/type/binary.html我如何在 yaml-cpp 中执行此操作？最佳答案截至revisi
c++ - 如何从另一个 .cpp 文件中的一个 .cpp 文件调用函数？
我尝试查找此内容并使用头文件等得到混合结果。基本上我有多个 .cpp 文件，其中包含我为使用二叉树而制作的所有函数，BST , 链表等我不想复制和粘贴我需要的函数，我只想能够做一个: #inclu
yaml-cpp - 如何为特定的 yaml-cpp 节点设置发射样式
我正在发出一个 YAML 文档，如下所示: YAML::Node doc; // ...populate doc... YAML::Emitter out; out << doc; 在节点层次结构的某
c++ - 在另一个 .cpp 文件中访问一个 .cpp 文件中定义的全局变量
这个问题在这里已经有了答案: Access extern variable in C++ from another file (1 个回答) 关闭 4 年前。考虑以下场景: MyFile.cpp:
c++ - 尝试链接 .cpp 文件时出现多重定义错误(头文件中没有 .cpp)
所以我在上基础编程课，我们正在学习如何将文件链接在一起。问题是我遇到了一个似乎没有人能够修复的错误。我已经去过我的教授、学生助理和校园里的编程辅助实验室，但运气不佳。我还在这里搜索了至少 10 篇与
yaml-cpp - 使用 YAML-CPP 发布解析文件
在下面的代码中，我在使用 parser.GetNextDocument(doc); 解析我的 .yaml 文件时遇到了一些问题。经过大量调试后，我发现这里的(主要)问题是我的 for 循环没有运行，因
c++ - 如何将两个 cpp 添加到一个 cpp 程序文件
我们有以下类(class)考试成绩:完成本类(class)的学生中有 75 人参加了考试。我们想知道学生在考试中的表现如何，并给出了 75 名学生的分数。我们想编写一个程序，按以下方式总结和分析结果:
c++ - Main.cpp 无法访问头文件和其他 .cpp 文件中的变量和函数
主要.cpp #include #include #include #include "cootie.h" using namespace std; int main() { cout
c++ - 类在多个文件上的使用 .h .cpp main.cpp
试图制作电子鸡程序，但编译器抛出未定义的对“Tamagotchi::age()”错误的引用理想情况下，这段代码会返回电子鸡的年龄，它应该在开始时由类的构造函数初始化为 0。我显然在某个地方搞砸了，
c++ - 从另一个 .cpp 文件的主体编译一个 .cpp 文件
我一直在开发一个使用 Microsoft Visual Studio 2010 命令提示符编译原始 .cpp 文件并分析其输出的应用程序。我遇到了很多麻烦，网上似乎没有太多关于这个的资料。这是麻烦的代
c++ - 从另一个 .cpp 文件调用 cpp 函数时出错
我试图从另一个 .cpp 文件调用 c++ 函数。我使用了 .h header 。看看下面我做了什么。我有一个f.h文件: #ifndef PACKAGENAME_ADD_H #define PAC
C# 从 CPP 调用未知数量的参数的 CPP 函数
我在 CPP 中有一个函数，其原型(prototype)如下: char* complexFunction(char* arg1, ...); 我使用 DLLImport 属性从 C# 导入它。问题是
yaml-cpp - 包括没有可用 Boost 的 yaml-cpp -
也许这是一个幼稚的问题 - 但有没有办法构建/安装 yaml-cpp，以便在构建包含 yaml.h 的项目时不需要使用 Boost 库 header ？ IE:我正在开发一个使用 yaml-cpp 结
c++ - 有没有办法在同一项目的另一个 .cpp 中使用 .cpp 中声明的静态 void
我有一个在 .cpp 函数中声明的静态函数，我不能在 header 中声明它，因为它不应该是可见的。我想在同一项目的另一个 .cpp 中重新使用它。这有可能吗？最佳答案这里有两个问题: 这可能吗
php - 编译 php-cpp main.cpp 文件时出错
我正在使用 php-cpp 为我的 php 代码创建扩展，当我尝试编译 main.cpp 文件的简单结构时，我得到这个错误。这是编译错误: main.cpp:15:5: error: ‘PHPCPP_
c++ - 使用模板类时似乎无法包含除 main.cpp 以外的任何 cpp 文件
我决定将必要的代码减少到显示此错误所需的最低限度。我有一个存在于 hc_list.h 文件中的 STL 列表包装器模板类。完整代码如下: // hc_list.h file #ifndef HC_LI
c++ - AMQP-CPP RabbitMQ 构建集成到 CPP 项目
您好，我目前正在尝试通过 AMQPCPP 将 RabbitMQ 集成到我的 VisualStudio 项目中。我只能使用 Windows PC，这对安装来说是一件很痛苦的事情。我想我能够使用 CMAK

撒哈拉

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

llama.cpp推理流程和常用函数介绍

1、推理流程

1. 过程描述

2. 概念介绍

3. 程序结构

4. 源码分析

(1) 初始化

(2) 用户输入

(3) 分析预测

(4) 推理采样

2、关键函数

3、总结

首页

博学

6Ren·AI

商城

llama.cpp推理流程和常用函数介绍

﻿1、推理流程

1. 过程描述

2. 概念介绍

3. 程序结构

4. 源码分析

(1) 初始化

(2) 用户输入

(3) 分析预测

(4) 推理采样

2、关键函数

3、总结

1、推理流程