gpt4 book ai didi

python - 如何比较pandas大型数据帧(python3.x)中的两个字符串?

转载 作者:行者123 更新时间:2023-12-01 07:58:56 28 4
gpt4 key购买 nike

我有来自 2 个 Excel 文件的两个 DF。

第一个文件(awcProjectMaster)(1500 条记录)

projectCode    projectName
100101 kupwara
100102 kalaroos
100103 tangdar

第二档(村长)(超过1000万条记录)

villageCode    villageName
425638 wara
783651 tangdur
986321 kalaroo

我需要比较项目名称和村庄名称以及匹配百分比。下面的代码工作正常,但速度很慢。我怎样才能以更有效的方式做同样的事情。

import pandas as pd
from datetime import datetime

df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")


def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
with open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a") as f:
percentMatch = 0
vLen = len(vName)
prjLen = len(prjName)
if vLen > prjLen:
if vName.find(prjName) != -1:
percentMatch = (prjLen / vLen) * 100
f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
else:
res = 0
# print(res)
elif prjLen >= vLen:
if prjName.find(vName) != -1:
percentMatch = (vLen / prjLen) * 100
f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
else:
res = 0
# print(res)
f.close()


for idx, row in df.iterrows():
for idxv, r in df1.iterrows():
compare(
str(row["ProjectCode"]),
row["ProjectName"].lower(),
str(r["StateCensusCode"]),
r["StateName"],
str(r["DistrictCode"]),
r["DistrictName"],
str(r["SubDistrictCode"]),
r["SubDistrictNameInEnglish"],
str(r["VillageCode"]),
r["VillageNameInEnglish"].lower(),
)

最佳答案

您的字符串距离度量不太准确,但如果它适合您,那就没问题了。 (不过,您可能想研究其他选项,例如内置的 difflib 或 Python-Levenshtein 模块。)

如果您确实需要成对比较 1,500 x 10,000,000 条记录,那么肯定需要一些时间,但我们可以很容易地采取一些措施来加快速度:

  • 仅打开日志文件一次;这会产生开销,有时甚至是很大的开销
  • 将比较函数重构为单独的单元,然后应用 lru_cache() 内存装饰器以确保每对仅比较一次,并将后续结果缓存在内存中。 (此外,看看我们如何对 vName/prjName 对进行排序 - 由于两个字符串的实际顺序并不重要,因此我们最终得到了一半的缓存大小。 )

然后为了总体清洁,

  • 使用 csv 模块将 CSV 流式传输到文件中(输出格式与您的代码略有不同,但您可以使用 dialect 参数将其更改为 csv.writer())。

希望这有帮助!

import pandas as pd
from datetime import datetime
from functools import lru_cache
import csv

df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")

log_file = open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a")
log_writer = csv.writer(log_file)


@lru_cache()
def compare_vname_prjname(vName, prjName):
vLen = len(vName)
prjLen = len(prjName)
if vLen > prjLen:
if vName.find(prjName) != -1:
return (prjLen / vLen) * 100
elif prjLen >= vLen:
if prjName.find(vName) != -1:
return (vLen / prjLen) * 100
return None


def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
# help the cache decorator out by halving the number of possible pairs:
vName, prjName = sorted([vName, prjName])
percent_match = compare_vname_prjname(vName, prjName)
if percent_match is None: # No match
return False
log_writer.writerow(
[
prjCode,
prjName,
vCode,
vName,
round(percent_match),
stCode,
stName,
dCode,
dName + sdCode,
sdName,
]
)
return True


for idx, row in df.iterrows():
for idxv, r in df1.iterrows():
compare(
str(row["ProjectCode"]),
row["ProjectName"].lower(),
str(r["StateCensusCode"]),
r["StateName"],
str(r["DistrictCode"]),
r["DistrictName"],
str(r["SubDistrictCode"]),
r["SubDistrictNameInEnglish"],
str(r["VillageCode"]),
r["VillageNameInEnglish"].lower(),
)

关于python - 如何比较pandas大型数据帧(python3.x)中的两个字符串?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55809560/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com