python - 使用 JSONloader 进行 LangChain 对话检索-6ren

python - 使用 JSONloader 进行 LangChain 对话检索

转载作者：行者123 更新时间：2023-12-02 05:48:19

24

4

我修改了这个源代码的数据加载器https://github.com/techleadhd/chatgpt-retrieval让 ConversationalRetrievalChain 接受 JSON 数据。

我创建了一个虚拟 JSON 文件，根据 LangChain 文档，它符合文档中描述的 JSON 结构。

{
  "reviews": [
    {"text": "Great hotel, excellent service and comfortable rooms."},
    {"text": "I had a terrible experience at this hotel. The room was dirty and the staff was rude."},
    {"text": "Highly recommended! The hotel has a beautiful view and the staff is friendly."},
    {"text": "Average hotel. The room was okay, but nothing special."},
    {"text": "I absolutely loved my stay at this hotel. The amenities were top-notch."},
    {"text": "Disappointing experience. The hotel was overpriced for the quality provided."},
    {"text": "The hotel exceeded my expectations. The room was spacious and clean."},
    {"text": "Avoid this hotel at all costs! The customer service was horrendous."},
    {"text": "Fantastic hotel with a great location. I would definitely stay here again."},
    {"text": "Not a bad hotel, but there are better options available in the area."}
  ]
}

代码是:

import os
import sys

import openai
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.document_loaders import JSONLoader

os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY_HERE'

# Enable to save to disk & reuse the model (for repeated queries on the same data)
PERSIST = False

query = None
if len(sys.argv) > 1:
  query = sys.argv[1]


if PERSIST and os.path.exists("persist"):
  print("Reusing index...\n")
  vectorstore = Chroma(persist_directory="persist", embedding_function=OpenAIEmbeddings())
  index = VectorStoreIndexWrapper(vectorstore=vectorstore)
else:

  loader = JSONLoader("data/review.json", jq_schema=".reviews[]", content_key='text') # Use this line if you only need data.json

  if PERSIST:
    index = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory":"persist"}).from_loaders([loader])
  else:
    index = VectorstoreIndexCreator().from_loaders([loader])

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model="gpt-3.5-turbo"),
  retriever=index.vectorstore.as_retriever()
)

chat_history = []
while True:
  if not query:
    query = input("Prompt: ")
  if query in ['quit', 'q', 'exit']:
    sys.exit()
  result = chain({"question": query, "chat_history": chat_history})
  print(result['answer'])

  chat_history.append((query, result['answer']))
  query = None

一些结果示例是:

Prompt: can you summarize the data?
Sure! Based on the provided feedback, we have a mix of opinions about the hotels. One person found it to be an average hotel with nothing special, another person had a great experience with excellent service and comfortable rooms, another person was pleasantly surprised by a hotel that exceeded their expectations with spacious and clean rooms, and finally, someone had a disappointing experience with an overpriced hotel that didn't meet their expectations in terms of quality.

Prompt: how many feedbacks present in the data ?
There are four feedbacks present in the data.

Prompt: how many of them are positive (sentiment)?
There are four positive feedbacks present in the data.

Prompt: how many of them are negative?
There are three negative feedbacks present in the data.

Prompt: how many of them are neutral?
Two of the feedbacks are neutral.

Prompt: what is the last review you can see?
The most recent review I can see is: "The hotel exceeded my expectations. The room was spacious and clean."

Prompt: what is the first review you can see?
The first review I can see is "Highly recommended! The hotel has a beautiful view and the staff is friendly."

Prompt: how many total texts are in the JSON file?
I don't know the answer.

我可以用我的数据聊天，但除了第一个答案之外，所有其他答案都是错误的。

JSONloader 或 jq_scheme 是否有问题？如何调整代码以便生成预期的输出？

最佳答案

在 ConversationalRetrievalChain 中，搜索设置为默认 4，请参阅 ../langchain/chains/conversational_retrieval/base.py 中的 top_k_docs_for_context: int = 4 。

这是有道理的，因为您不想将所有向量发送到 LLM 模型(也有相关成本)。根据用例，您可以使用以下命令将默认值更改为更易于管理:

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model="gpt-3.5-turbo"),
  retriever=index.vectorstore.as_retriever(search_kwargs={"k": 10})
)

通过此更改，您将得到结果

{'question': 'how many feedbacks present in the data ?',
 'chat_history': [],
 'answer': 'There are 10 pieces of feedback present in the data.'}

关于python - 使用 JSONloader 进行 LangChain 对话检索，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76670856/

24

4

0

文章推荐： openai-api - Azure Open AI Studio 上传数据帮助指南

python - “langchain”不是一个包
尝试运行 langchain 的基本教程脚本时遇到此错误: ModuleNotFoundError: No module named 'langchain.llms'; 'langchain' is
python - “langchain”不是一个包
尝试运行 langchain 的基本教程脚本时遇到此错误: ModuleNotFoundError: No module named 'langchain.llms'; 'langchain' is
langchain:Prompt在手,天下我有
目录简介好的prompt 什么是prompt template 在langchain中创建prompt temp
python - 如何为使用 Langchain 的代理按特定顺序选择正确的工具？
我想我不明白代理如何选择工具。我有一个矢量数据库(Chroma)，其中嵌入了我希望代理首先查看的所有内部知识。然后，如果答案不在 Chroma 数据库中，它应该使用 OpenAI 用于训练的信息(外部
openai-api - Langchain 使用人工工具抛出解析响应错误
我正在尝试使用 langchain 代理为软件工程师生成一个月的面试计划。期望代理应该询问用户几个问题并制定计划。 import os from langchain.memory import Con
python - 如何将 langchain 文档转换回字符串？
我用 langchain 库构建了一个分割器函数，可以分割一系列 python 文件。在代码中的另一点，我需要将这些文档转换回 python 代码。只是我不知道该怎么做 def index_repo(
python - 如何在文本分割器(langchain)之后将代码分配给文件？
我正在使用 Langchain 的 RecursiveCharacterTextSplitter 来分割 python 文件。这样做我会丢失哪个 block 属于哪个文件的信息。之后如何跟踪各个 bl
chatbot - 如何使用 langchain 创建多用户聊天机器人
希望你做得很好。我根据以下 langchain 文档准备了一个聊天机器人: Langchain chatbot documentation 在上面的langchain文档中，提示模板有两个输入变量——
python - 如何在 Langchain 中传输代理的响应？
我在Python中使用带有Gradio接口(interface)的Langchain。我制作了一个对话代理，并尝试将其响应传输到 Gradio 聊天机器人界面。我查看了 Langchain 文档，但找
python - 如何使用 Langchain 获得更详细的结果来源
我正在尝试使用 Langchain 和特定 URL 作为源数据来整理一个简单的“带有来源的问答”。该 URL 由一个页面组成，其中包含大量信息。问题是 RetrievalQAWithSourcesC
chatbot - 如何使用 langchain 创建多用户聊天机器人
希望你做得很好。我根据以下 langchain 文档准备了一个聊天机器人: Langchain chatbot documentation 在上面的langchain文档中，提示模板有两个输入变量——
python - 如何在 Langchain 中传输代理的响应？
我在Python中使用带有Gradio接口(interface)的Langchain。我制作了一个对话代理，并尝试将其响应传输到 Gradio 聊天机器人界面。我查看了 Langchain 文档，但找
python - 如何使用 Langchain 获得更详细的结果来源
我正在尝试使用 Langchain 和特定 URL 作为源数据来整理一个简单的“带有来源的问答”。该 URL 由一个页面组成，其中包含大量信息。问题是 RetrievalQAWithSourcesC
python - 如何为使用 Langchain 的代理按特定顺序选择正确的工具？
我想我不明白代理如何选择工具。我有一个矢量数据库(Chroma)，其中嵌入了我希望代理首先查看的所有内部知识。然后，如果答案不在 Chroma 数据库中，它应该使用 OpenAI 用于训练的信息(外部
openai-api - Langchain 使用人工工具抛出解析响应错误
我正在尝试使用 langchain 代理为软件工程师生成一个月的面试计划。期望代理应该询问用户几个问题并制定计划。 import os from langchain.memory import Con
python - 如何将 langchain 文档转换回字符串？
我用 langchain 库构建了一个分割器函数，可以分割一系列 python 文件。在代码中的另一点，我需要将这些文档转换回 python 代码。只是我不知道该怎么做 def index_repo(
python - 如何在文本分割器(langchain)之后将代码分配给文件？
我正在使用 Langchain 的 RecursiveCharacterTextSplitter 来分割 python 文件。这样做我会丢失哪个 block 属于哪个文件的信息。之后如何跟踪各个 bl
SvelteKit:显示来自 Langchain 的聊天流 token
我正在开发一个使用 SvelteKit 和 Langchain 的项目。我想实现一个功能，我可以按下按钮并让 UI 在聊天流进入时显示它们的标记。但是，我当前使用表单操作的实现遇到了一些困难。这是我
python - 尝试跟踪 Langchain 中的代币使用情况时出现 ValueError
我正在关注langchain官方文档here中的本教程我是否尝试在使用时跟踪 token 的数量。但是，我想使用 gpt-3.5-turbo 而不是 text-davinci-003，因此我将使用的
python - 使用Vicuna + langchain + llama_index 创建自托管LLM模型
我想创建一个自托管的 LLM 模型，该模型将能够拥有我自己的自定义数据的上下文(就此而言，Slack 对话)。我听说 Vicuna 是 ChatGPT 的一个很好的替代品，所以我编写了以下代码: f

首页

博学

6Ren·AI

商城

python - 使用 JSONloader 进行 LangChain 对话检索