gpt4 book ai didi

python - Pandas :将各种相似的子字符串映射到单一的标准格式

转载 作者:太空宇宙 更新时间:2023-11-04 01:54:46 25 4
gpt4 key购买 nike

Dataframe 列具有公司名称的各种格式子字符串,需要映射到公司名称的固定表示形式。这多种格式记录在sotest.json中:

{
"ABERCOMBIEFITCH": ["A&F", "A & F", "A& F", "ABERCOMBIE & FITCH"],
"COCACOLA": ["COKE", "COCA-COLA", "COCACOLA"]
}

这个json读入df如下:

with open('sotest.json') as tf:
testdata = json.load(tf)
indexlist = []
itemslist = []
for k, v in testdata.items():
indexlist.append(k)
itemslist.append(v)
sojsondf = pd.DataFrame({'AssortedNames': itemslist}, index = indexlist)

下面是一个test-df:

namesdf = pd.DataFrame(data = ["A&F Ltd", "A & F CO", "A& F COMPANY", "ABERCOMBIE & FITCH LIMITED", 
"COKE M/S", "COCA-COLA COMPANY", "COCACOLA BOTTLING CO", "SONY"],
columns = ['RecordedCompanyName'])

并将以下函数应用于上面的 df 列以获得标准化输出:

def sorowchecker(inputstring, sojsondf):
match = False
for i, row in sojsondf.iterrows():
if any(sponsor in inputstring for sponsor in row['AssortedNames']):
match = True
if match == True:
break
return i if match == True else "DIRECTMARKETING"

上述功能的使用:

   namesdf['Company'] = namesdf['RecordedCompanyName'].apply(sorowchecker, args=(sojsondf, ))

实际 namesdf.shape[0] ~ 60k 和实际 sojsondf.shape[0] ~ 50 这意味着程序需要相当长的时间。关于如何使 sorowchecker() 运行得更快和/或其他改进的任何建议(对使用并发的任何事物的额外荣誉)?谢谢

最佳答案

IIUC,你不需要创建新的dataframe,只需使用字典创建一个逆字典和map:

with open('sotest.json') as tf:
testdata = json.load(tf)

backward = {x:k for k,v in testdata.items() for x in v}

# pattern to check if any key in the names
pattern = '|'.join(backward.keys())

# output:
(namesdf['RecordedCompanyName']
.str.extract(f'({pattern})')[0] # extract the first match key
.map(backward) # convert the match key to actual name
.fillna('DIRECTMARKETING') # replace the none-match with default
)

输出:

0    ABERCOMBIEFITCH
1 ABERCOMBIEFITCH
2 ABERCOMBIEFITCH
3 ABERCOMBIEFITCH
4 COCACOLA
5 COCACOLA
6 COCACOLA
7 DIRECTMARKETING
Name: 0, dtype: object

关于python - Pandas :将各种相似的子字符串映射到单一的标准格式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57042449/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com