gpt4 book ai didi

Python计算两个文件目录的余弦相似度

转载 作者:行者123 更新时间:2023-12-04 14:02:07 25 4
gpt4 key购买 nike

我有两个文件目录。一个包含人工转录的文件,另一个包含 IBM Watson 转录的文件。这两个目录具有相同数量的文件,并且都转录自相同的电话录音。

我正在使用 SpaCy 的匹配文件之间的相似度计算余弦相似度,并将结果与​​比较的文件名一起打印或存储。除了 for 循环之外,我还尝试使用函数进行迭代,但找不到在两个目录之间进行迭代、将两个文件与匹配索引进行比较并打印结果的方法。

这是我当前的代码:

# iterate through files in both directories
for human_file, api_file in os.listdir(human_directory), os.listdir(api_directory):
# set the documents to be compared and parse them through the small spacy nlp model
human_model = nlp_small(open(human_file).read())
api_model = nlp_small(open(api_file).read())

# print similarity score with the names of the compared files
print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

我已经让它可以只遍历一个目录并通过打印文件名检查它是否具有预期的输出,但是当同时使用两个目录时它不起作用。我也试过这样的事情:

# define directories
human_directory = os.listdir("./00_data/Human Transcripts")
api_directory = os.listdir("./00_data/Watson Scripts")

# function for cosine similarity of files in two directories using small model
def nlp_small(human_directory, api_directory):
for i in (0, (len(human_directory) - 1)):
print(human_directory[i], api_directory[i])

nlp_small(human_directory, api_directory)

哪个返回:

human_10.txt watson_10.csv
human_9.txt watson_9.csv

但这只是其中的两个文件,而不是所有 17 个文件。

任何关于遍历两个目录上的匹配索引的指示都将不胜感激。

编辑:感谢@kevinjiang,这是工作代码块:

# set the directories containing transcripts
human_directory = os.path.join(os.getcwd(), "00_data\Human Transcripts")
api_directory = os.path.join(os.getcwd(), "00_data\Watson Scripts")

# iterate through files in both directories
for human_file, api_file in zip(os.listdir(human_directory), os.listdir(api_directory)):
# set the documents to be compared and parse them through the small spacy nlp model
human_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Human Transcripts", human_file)).read())
api_model = nlp_small(open(os.path.join(os.getcwd(), "00_data\Watson Scripts", api_file)).read())

# print similarity score with the names of the compared files
print("Similarity using small model:", human_file, api_file, human_model.similarity(api_model))

这里是大部分输出(需要在其中一个文件中修复一个 UTF-16 字符来停止循环):

nlp_small = spacy.load('en_core_web_sm')
Similarity using small model: human_10.txt watson_10.csv 0.9274665883462793
Similarity using small model: human_11.txt watson_11.csv 0.9348740684005554
Similarity using small model: human_12.txt watson_12.csv 0.9362025469343344
Similarity using small model: human_13.txt watson_13.csv 0.9557355330988958
Similarity using small model: human_14.txt watson_14.csv 0.9088701120190216
Similarity using small model: human_15.txt watson_15.csv 0.9479464053189846
Similarity using small model: human_16.txt watson_16.csv 0.9599724037676819
Similarity using small model: human_17.txt watson_17.csv 0.9367605599306302
Similarity using small model: human_18.txt watson_18.csv 0.8760760037870665
Similarity using small model: human_2.txt watson_2.csv 0.9184563762823503
Similarity using small model: human_3.txt watson_3.csv 0.9287452822270265
Similarity using small model: human_4.txt watson_4.csv 0.9415664367046419
Similarity using small model: human_5.txt watson_5.csv 0.9158895909429551
Similarity using small model: human_6.txt watson_6.csv 0.935313240861153

修复字符编码错误后,我将把它包装在一个函数中,这样我就可以在两个目录上调用大型或小型模型,以获取我必须测试的其余 API。

最佳答案

两个小错误阻止您循环。对于第二个示例,在 for 循环中,您仅循环索引 0 和索引 (len(human_directory) - 1))。相反,您应该执行 for i in range(len(human_directory)): 这应该允许您循环遍历两者。

首先,我认为您可能会遇到某种太多值无法解包错误。要同时循环两个可迭代对象,请使用 zip(),所以它应该看起来像

对于 human_file,zip 中的 api_file(os.listdir(human_directory),os.listdir(api_directory)):

关于Python计算两个文件目录的余弦相似度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69653164/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com