gpt4 book ai didi

python - 如何找到每个客户的相似地址数量?

转载 作者:行者123 更新时间:2023-12-01 09:18:40 25 4
gpt4 key购买 nike

我有一个包含两列的数据集:客户id地址:

id      addresses
1111 asturias 32, benito juarez, CDMX
1111 JOSE MARIA VELASCO, CDMX
1111 asturias 32 DEPT 401, INSURGENTES, CDMX
1111 deportes
1111 asturias 32, benito juarez, MIXCOAC, CDMX
1111 cd. de los deportes
1111 deportes, wisconsin
2222 TORRE REFORMA LATINO, CDMX
2222 PERISUR 2890
2222 WE WORK, CDMX
2222 WEWORK, TORRE REFORMA LATINO, CDMX
2222 PERISUR: 2690, COYOCAN
2222 TORRE REFORMA LATINO

我有兴趣找到每个客户的不同地址数量。例如,对于客户 id 1111,有 3 个不同的地址:

  1. [阿斯图里亚斯 32,贝尼托胡亚雷斯,CDMX,
    阿斯图里亚斯第 32 部队 401、叛乱分子、CDMX、
    阿斯图里亚斯 32、贝尼托·胡亚雷斯、MIXCOAC、CDMX]

  2. [JOSE MARIA VELASCO,CDMX]

  3. [驱逐出境,
    光盘。德洛斯驱逐者,
    威斯康星州驱逐]

我用 python 编写了一段代码,它只能显示两个连续行之间的相似性:行 i 和行 i+1 (得分 0 表示完全不相似,1 表示完全相似)。

id      addresses                                  score
1111 asturias 32, benito juarez, CDMX 0
1111 JOSE MARIA VELASCO, CDMX 0
1111 asturias 32 DEPT 401, INSURGENTES, CDMX 0
1111 deportes 0
1111 asturias 32, benito juarez, MIXCOAC, CDMX 0
1111 cd. de los deportes 0.21
1111 deportes, wisconsin 0
2222 TORRE REFORMA LATINO, CDMX 0
2222 PERISUR 2890 0
2222 WE WORK, CDMX 0.69
2222 WEWORK, TORRE REFORMA LATINO, CDMX 0
2222 PERISUR: 2690, COYOCAN 0
2222 TORRE REFORMA LATINO

如果得分 > 0.20,我将认为它们是两个不同的地址。以下是我的代码:

import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata
import unidecode
import string
from sklearn.feature_extraction.text import TfidfVectorizer

data=pd.read_csv('address.csv')
nltk.download('punkt')
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(
nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0, 1]

cnt = np.array(np.arange(0, 5183))
indx = []

for i in cnt:
print cosine_sim(data['address'][i], data['address'][i + 1])

但是上面的代码无法比较特定客户id的每个可能的行。我想要如下输出:

id     unique address
1111 3
2222 3
3333 2

最佳答案

您可以在 itertools 中使用组合来实现此目的。请参阅下面的完整代码。

请注意,我使用分号分隔的 CSV 文件

此外,如果需要,您还可以使用 SPACY 中的similarity 函数来查找两个短语之间的相似性。这里我使用了您提供的相同功能。

import nltk
import numpy as np
import pandas as pd
import itertools
import string
from sklearn.feature_extraction.text import TfidfVectorizer


def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(
nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0, 1]

def group_addresses(addresses):
'''merge the lists if they have an element in common'''
out = []
while len(addresses)>0:
# first, *rest = addresses # for python 3
first, rest = addresses[0], addresses[1:] # for python2
first = set(first)
lf = -1
while len(first)>lf:
lf = len(first)

rest2 = []
for r in rest:
if len(first.intersection(set(r)))>0:
first |= set(r)
else:
rest2.append(r)
rest = rest2

out.append(first)
addresses = rest
return out


df=pd.read_csv("address.csv", sep=";")
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

sim_df = pd.DataFrame(columns=['id', 'unique address'])

for customer in set(df['id']):
customer_addresses = (df.loc[df['id'] == customer]['addresses']) #Get the addresses of each customer
all_entries = [[adr] for adr in customer_addresses] #Make list of lists
sim_pairs = [list((text1, text2)) for text1, text2 in itertools.combinations(customer_addresses, 2) if cosine_sim(text1, text2) >0.2 ] # Find all pairs whose similiarty is greater than 0.2
all_entries.extend(sim_pairs)
sim_pairs = group_addresses(all_entries)
print(customer , len(sim_pairs))

输出看起来像

2222 2
1111 3

形成的团体是

2222
['WE WORK, CDMX', 'WEWORK, TORRE REFORMA LATINO, CDMX', 'TORRE REFORMA LATINO, CDMX', 'TORRE REFORMA LATINO']
['PERISUR 2890', 'PERISUR: 2690, COYOCAN']

1111
['asturias 32 DEPT 401, INSURGENTES, CDMX', 'asturias 32, benito juarez, MIXCOAC, CDMX', 'asturias 32, benito juarez, CDMX']
['JOSE MARIA VELASCO, CDMX']
['deportes, wisconsin', 'cd. de los deportes', 'deportes']

关于python - 如何找到每个客户的相似地址数量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50995179/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com