gpt4 book ai didi

python - 数据的相似性度量/矩阵(推荐系统)- Python

转载 作者:行者123 更新时间:2023-11-30 09:10:23 25 4
gpt4 key购买 nike

我是机器学习新手,正在尝试解决以下问题。输入是 2 个具有相同长度的描述数组,输出是第一个数组中的第一个字符串与第二个数组中的第一个字符串相比的相似性分数数组,依此类推。

数组(numpy数组)中的每一项都是一串描述。你能否编写一个函数,通过计算有多少个相同且共现的单词 ID 来找出两个字符串之间的相似程度,并为其分配一个分数(一个可能的权重可以基于共现频率与频率之和)单个单词 ID)。然后将该函数应用于两个数组以获得分数数组。如果您还想考虑其他方法,也请告诉我。谢谢!

数据:

array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',
'18/19/20/21/22/23/24/25',
'26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',
'5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',
'57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',
'70/71/72/73/74/75/76/77',
'78/79/80/81/82/83/84/85/86/87/88/89/90/91',
'33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',
'104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',
'117/118/119/120/121/12/122/123/124/125',
'14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',
'137/138/139/140/141/142',
'143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',
'160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',
'172/173/174/175/176/177/73/178/104/179/180/179/181/173',
'182/144/183/179/73',
'184/163/68/185/163/8/186/187/188/54/189/190/191',
'181/192/0/1/193/194/22/195',
'113/196/197/198/68/199/68/200/201/202/203/201',
'204/205/206/207/208/209/68/200',
'163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',
'220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',
'214/228/5/6/5/215/228/228/229',
'230/231/232/233/122/215/128/214/128/234/234',
'235/236/191/237/92/93/238/239',
'13/14/44/44/240/241/242/49/54/243/244/245/55/56',
'220/21/246/38/247/201/248/73/160/249/250/203/201',
'214/49/251/252/253/254/255/256/257/258'],
dtype='|S127')

array(['151/308/309/310/311/215/312/160/313/214/49/12',
'314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',
'324/325/62/220/326/194/327/328/218/76/241/329',
'330/29/22/103/331/314/68/80/49',
'78/332/85/96/97/227/333/4/334/188',
'57/335/336/34/187/337/21/338/212/213/339/340',
'341/342/167/343/8/254/154/61/344',
'2/292/345/346/42/347/348/348/100/349/202/161/263',
'283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',
'137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',
'23/363/10/364/289/68/123/354/355',
'188/28/365/149/366/98/367/368/369/370/371/372/368',
'373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',
'179/376/377/378/179/87/88/379/20',
'380/85/381/333/382/215/128/383/384', '385/129/386/387/388',
'389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',
'397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',
'77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',
'129/295/90/259/38/39/119/414/415/416/14/318/417/418',
'419/420/421/422/423/23/424/241/421/425/58',
'426/244/427/5/428/49/76/429/430/431',
'257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',
'439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',
'385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'],
dtype='|S127')

最佳答案

以下代码应该可以帮助您满足 Python 3.x 中的需要

import numpy as np
from collections import Counter

def jaccardSim(c1, c2):
cU = c1 | c2
cI = c1 & c2
sim = sum(cI.values()) / sum(cU.values())
return sim

def byteArraySim(b1, b2):
cA = [Counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b1))]
cB = [Counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b2))]

# Assuming both 'a' and 'b' are in the same length
cSim = [jaccardSim(cA[i], cB[i]) for i in range(len(a))]

return cSim # Array of similarities

此实现中使用了 Jaccard 相似度分数。您可以根据自己的喜好使用其他分数,例如余弦或汉明。

假设数组存储在变量 ab 中,则生成的函数 byteArraySim(a,b) 输出以下相似度分数:

[0.0,
0.0,
0.0,
0.038461538461538464,
0.0,
0.041666666666666664,
0.0,
0.0,
0.0,
0.08,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.058823529411764705,
0.0,
0.0,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0]

关于python - 数据的相似性度量/矩阵(推荐系统)- Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40570603/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com