gpt4 book ai didi

openai-api - OpenAI API : How do I count tokens before(! ) 我发送 API 请求?

转载 作者:行者123 更新时间:2023-12-02 22:46:35 25 4
gpt4 key购买 nike

OpenAI 的文本模型具有上下文长度,例如:Curie 的上下文长度为 2049 个标记。它们提供 max_tokens 和 stop 参数来控制生成序列的长度。因此,当获得停止 token 或达到 max_tokens 时,生成就会停止。

问题是:生成文本时,我不知道提示符包含多少个标记。因为我不知道,所以我无法设置 max_tokens = 2049 - number_tokens_in_prompt。

这使我无法为各种长度的文本动态生成文本。我需要的是继续生成直到停止 token 。

我的问题是:

  • 如何计算 Python API 中的 token 数量?这样我就会相应地设置 max_tokens 参数。
  • 有没有办法将 max_tokens 设置为最大上限,这样我就不需要计算提示 token 的数量?

最佳答案

正如官方所说OpenAI article :

To further explore tokenization, you can use our interactive Tokenizertool, which allows you to calculate the number of tokens and see howtext is broken into tokens. Alternatively, if you'd like to tokenizetext programmatically, use Tiktoken as a fast BPE tokenizerspecifically used for OpenAI models. Other such libraries you canexplore as well include transformers package for Python or thegpt-3-encoder package for NodeJS.

分词器可以将文本字符串拆分为标记列表,如官方 OpenAI example 中所述。关于使用 Tiktoken 计算代币:

Tiktoken is a fast open-source tokenizer by OpenAI.

Given a text string (e.g., "tiktoken is great!") and an encoding(e.g., "cl100k_base"), a tokenizer can split the text string into alist of tokens (e.g., ["t", "ik", "token", " is", " great", "!"]).

Splitting text strings into tokens is useful because GPT models seetext in the form of tokens. Knowing how many tokens are in a textstring can tell you:

  • whether the string is too long for a text model to process and
  • how much an OpenAI API call costs (as usage is priced by token).

Tiktoken 支持 OpenAI 模型使用的 3 种编码( source ):

编码名称 OpenAI 模型
cl100k_base gpt-4 , gpt-3.5-turbo , text-embedding-ada-002
p50k_base text-davinci-003 , text-davinci-002
r50k_base GPT-3 models ( text-curie-001text-babbage-001text-ada-001davincicuriebabbageada )

对于cl100k_basep50k_base编码:

对于r50k_base编码、分词器可用于多种语言:

请注意gpt-3.5-turbogpt-4使用代币的方式与官方OpenAI documentation中说明的其他型号相同:

Chat models like gpt-3.5-turbo and gpt-4 use tokens in the same way asother models, but because of their message-based formatting, it's moredifficult to count how many tokens will be used by a conversation.

If a conversation has too many tokens to fit within a model’s maximumlimit (e.g., more than 4096 tokens for gpt-3.5-turbo), you will haveto truncate, omit, or otherwise shrink your text until it fits. Bewarethat if a message is removed from the messages input, the model willlose all knowledge of it.

Note too that very long conversations are more likely to receiveincomplete replies. For example, a gpt-3.5-turbo conversation that is4090 tokens long will have its reply cut off after just 6 tokens.

如何使用抖音?

  1. 安装或升级tiktoken:pip install --upgrade tiktoken

  2. 您有两个选择。

选项 1:在上表中搜索给定 OpenAI 模型的正确编码

如果你运行get_tokens_1.py ,您将得到以下输出:

9

get_tokens_1.py

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens

print(num_tokens_from_string("Hello world, let's test tiktoken.", "cl100k_base"))

选项 2:使用 tiktoken.encoding_for_model()自动加载给定 OpenAI 模型的正确编码

如果你运行get_tokens_2.py ,您将得到以下输出:

9

get_tokens_2.py

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.encoding_for_model(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens

print(num_tokens_from_string("Hello world, let's test tiktoken.", "gpt-3.5-turbo"))

注意:如果您仔细查看 usage OpenAI API 响应中的字段,您会看到它报告 10用于相同消息的 token 。那是1代币比 Tiktoken 多。我还没弄清楚为什么。我过去对此进行了测试(请参阅我的 past answer )。正如 @Jota 在下面的评论中提到的,OpenAI API 响应报告的 token 使用情况与 Tiktoken 之间似乎仍然不匹配。

关于openai-api - OpenAI API : How do I count tokens before(! ) 我发送 API 请求?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75804599/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com