I just deployed the Nous-Hermes-Llama2-70b parameter on a 2x Nvidia A100 GPU through the Hugging Face Inference endpoints.
我刚刚通过拥抱面孔推理端点在2x NVIDIA A100图形处理器上部署了nous-hermes-lama2-70b参数。
When I tried the following code, the response generations were incomplete sentences that were less than 1 line long.
当我尝试以下代码时,响应生成的是不完整的句子,长度不到1行。
import requests
API_URL = 'https://myendpoint.us-east-1.aws.endpoints.huggingface.cloud'
headers = {
"Authorization": "Bearer mytoken1234",
"Content-Type": "application/json"
}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "### Instruction:\r\nCome up with a joke about cats\r\n### Response:\r\n",
})
The output in this case was:
本例中的输出为:
"Why don't cats play poker in the jungle?
Because "
As you see, the response stopped after 9 words.
如你所见,回复在9个单词后停止。
Do I need to add more headers to the request like temperature and max token length? How would I do that? What do I need to do to get normal, long responses?
我是否需要向请求添加更多标头,如温度和最大令牌长度?我该怎么做呢?我需要做什么才能得到正常的、长时间的回复?
Here is the model I'm using: https://huggingface.co/NousResearch/Nous-Hermes-Llama2-70b
这是我正在使用的模型:https://huggingface.co/NousResearch/Nous-Hermes-Llama2-70b
更多回答
With most generative AI, you can either wait a long time for the full generation to complete and get the full result back (which is atrociously slow if it's a lot of text), or you can stream the results in in realtime as they are completed; in this case I'm not sure if the Hugging Face Inference endpoints need to be treated specially in order to stream the result back in realtime, but given how things are behaving, that certainly seems like the case. I highly recommend looking at other examples, and determining how to check if the API is meant to be streaming the results back or not.
对于大多数产生型人工智能,您可以等待很长时间才能完成整个生成并返回完整的结果(如果文本很多,则会非常慢),或者您可以在结果完成时实时输入结果;在这种情况下,我不确定是否需要特殊处理拥抱面孔推理端点,以便实时传回结果,但考虑到事情的表现,情况肯定是这样的。我强烈建议查看其他示例,并确定如何检查API是否打算回传结果。
优秀答案推荐
Added "max_new_tokens" => 256 as a parameter, fixed it.
新增“max_new_tokens”=>256作为参数,修复
更多回答
我是一名优秀的程序员,十分优秀!