gpt4 book ai didi

python - 如何使用多个条件和模糊逻辑来匹配 pandas 数据框中的条目?

转载 作者:太空宇宙 更新时间:2023-11-03 20:26:18 24 4
gpt4 key购买 nike

感谢您的帮助。我相信这是一个常见问题,但我无法找到解决这种特殊形式的解决方案。我是一名新程序员,非常感谢任何帮助。

我有两组有关医疗保健公司的数据。 df1 中的数据很困惑并且包含空值,而 df2 中的数据则完整得多。

我需要匹配df1df2中的公司,确定是否存在匹配,如果不是直接匹配,则确定匹配的接近程度。两组数据都涉及数万家公司,并且每天都会更改/更新,因此我正在尝试构建可扩展的东西

这是我迄今为止尝试过的可重现的程序:

import pandas as pd
from fuzzywuzzy import process

data1 = [['1001', 'Lutheran Family Hospital', 'Omaha', 'NE'],
['1020', 'Lutheran Family Hospital', 'Dallas', 'TX'],
['1021', 'Lutheran Regional Family Hospital', 'Des Plaines', 'IL'],
['1002', 'Independent Health', 'Fairbanks', 'AK'],
['1003', 'Lucky You Community Clinic', '', ''],
['1004', 'Belmont General Hospital', 'Belmont', 'CA'],
['1005', 'Louisiana Chiro', 'Lafayette', 'LA'],
['1006', 'Steven, Even', 'Chicago', 'IL'],
['1007', 'Kind Kare 4 Kids', 'New Mexico', 'New Mexico'],
['1008', 'Independence Mem', '', ''],
['1009', 'Gerald Griffin Health', 'Missoula', 'Montana'],
['1010', 'INTERNAL MED', 'CHARLESTON', 'SC'],
['1011', 'Belmont Hospital', '', ''],
['1012', 'Belmont Gnrl', 'Belmont', 'CA'],
['1013', 'St Mary Rehab', '', ''],
['1014', 'Saint Mary Med Center', 'Los Angeles', 'California'],
['1025', "St. Mary's Of Lourdes Regional Medical Center", 'Lincoln', 'NE'],
['1015', 'Bryan Bennington, MD', 'Huntsville', 'AL']]

data2 = [['1', 'Lutheran General Hospital', 'Fort Wayne', 'IN'],
['2', 'Lutheran Family Hospital', 'Omaha', 'NE'],
['3', 'Independence Memorial Health', 'Fairbanks', 'AK'],
['4', 'Lucky-You Community Clinic', 'New York', 'NY'],
['5', 'Belmont General Hospital', 'Belmont', 'CA'],
['6', 'Lafayette Joints R Us (DBA Louisiana Best Chiropractic)', 'Lafayette', 'LA'],
['7', 'Even Steven, MD', 'Chicago', 'IL'],
['8', 'Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care)', 'Albequerque', 'NM'],
['9', 'The Best Doctor Group', 'Philadelphia', 'PA'],
['10', 'Internal Medical Group, PLLC', 'Charleston', 'SC'],
['11', "Saint Mary's Holy Name Rehabilitation", 'Lexington', 'KY'],
['12', 'St. Mary Regional Medical Center', 'Los Angeles', 'CA'],
['13', 'Advanced Outpatient Surgical Center', 'Seattle', 'WA']]

df1 = pd.DataFrame(data1, columns=['ID', 'Org_Name', 'City', 'State'])
df2 = pd.DataFrame(data2, columns=['ID', 'Org_Name', 'City', 'State'])

i = 0
scorethreshold = 80
df1["fuzzy"] = 0
for x in df1.Org_Name:
noun,score,record = process.extractOne(x,df2.Org_Name)
if score > scorethreshold:
df1.loc[i,'fuzzy'] = noun
else:
df1.loc[i,'fuzzy'] = None
i = i + 1

上面的结果如下:

+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
| | ID | Org_Name | City | State | fuzzy |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
| 0 | 1001 | Lutheran Family Hospital | Omaha | NE | Lutheran Family Hospital |
| 1 | 1020 | Lutheran Family Hospital | Dallas | TX | Lutheran Family Hospital |
| 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL | Lutheran Family Hospital |
| 3 | 1002 | Independent Health | Fairbanks | AK | Independence Memorial Health |
| 4 | 1003 | Lucky You Community Clinic | | | Lucky-You Community Clinic |
| 5 | 1004 | Belmont General Hospital | Belmont | CA | Belmont General Hospital |
| 6 | 1005 | Louisiana Chiro | Lafayette | LA | Lafayette Joints R Us (DBA Louisiana Best Chiropractic) |
| 7 | 1006 | Steven, Even | Chicago | IL | Even Steven, MD |
| 8 | 1007 | Kind Kare 4 Kids | New Mexico | New Mexico | Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care) |
| 9 | 1008 | Independence Mem | | | Independence Memorial Health |
| 10 | 1009 | Gerald Griffin Health | Missoula | Montana | |
| 11 | 1010 | INTERNAL MED | CHARLESTON | SC | Internal Medical Group, PLLC |
| 12 | 1011 | Belmont Hospital | | | Lutheran General Hospital |
| 13 | 1012 | Belmont Gnrl | Belmont | CA | Belmont General Hospital |
| 14 | 1013 | St Mary Rehab | | | Saint Mary's Holy Name Rehabilitation |
| 15 | 1014 | Saint Mary Med Center | Los Angeles | California | Saint Mary's Holy Name Rehabilitation |
| 16 | 1025 | St. Mary's Of Lourdes Regional Medical Center | Lincoln | NE | St. Mary Regional Medical Center |
| 17 | 1015 | Bryan Bennington, MD | Huntsville | AL | |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+

但是,我正在尝试创建一些东西,通过它我可以确定不仅公司名称匹配,而且城市和州也匹配,以及所有这些匹配的紧密程度。我正在尝试创建一个更像这样的输出,其中 Fuzzy_ID 指的是匹配条目的索引位置,而 Matched? 指的是 bool 判断:

+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
| | ID | Org_Name | City | State | Fuzzy_ID | Score | Matched? |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
| 0 | 1001 | Lutheran Family Hospital | Omaha | NE | 2 | 100 | YES |
| 1 | 1020 | Lutheran Family Hospital | Dallas | TX | 2 | some_score | NO |
| 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL | 2 | some_score | NO |
| 3 | 1002 | Independent Health | Fairbanks | AK | 3 | some_score | YES |
| 4 | 1003 | Lucky You Community Clinic | | | 4 | some_score | YES |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+

如何实现这一点?存在哪些方法更适合需要完成的任务?非常感谢您提供的任何帮助。

最佳答案

这个任务相当困难,涉及很多步骤,但至少我试图提出一些一般原则。

从整理状态列开始。如果某个地方有州的全名,请将其替换为州代码。

也许您还应该花一些时间来澄清df1中的“无状态”情况,作为清理数据的另一个步骤。

然后,对于 df1 中的每个,尝试在 df2 中查找最佳匹配行。为此,请使用以下过程:

  1. 使用process.extract,在df2中按名称查找最佳匹配池,与当前,假设limitscore_cutoff的一些值。如果包含状态,则仅 checkin df2来自该状态的行。将找到的每个匹配项的匹配率保存为 name_ratio

  2. 对于上述池中的每个项目,计算城市列上的WRatio,将其另存为city_ratio

  3. 使用一些聚合公式来计算每场比赛的total_ratio来自name_ratiocity_ratio。我也不确定这个公式应该如何。

  4. 以最大total_ratio进行匹配,但如果这个(最佳)比率低于某个total_ratio_cutoff,假设当前没有匹配项

当然,您仍然需要尝试特定的值参数并查看其值的变化如何影响最终结果。

关于python - 如何使用多个条件和模糊逻辑来匹配 pandas 数据框中的条目?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57793621/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com