gpt4 book ai didi

python - 多次匹配后合并 Pandas 数据框

转载 作者:太空宇宙 更新时间:2023-11-04 04:37:54 26 4
gpt4 key购买 nike

我有 2 个数据框,如果找到匹配项(在 sourcedestination 中的列之间),我会在其中找到基于列 (tld) 的常见匹配项) 我将列 (uuid) 的值从源复制到 destination 数据框。我还检查不同的列是否匹配。 (company_name) 然后提取 uuid

现在我需要比较不同的列 (similar_companies) 并提取 uuid

数据框 1:来源

   uuid           website company_name           tld
0 1a www.facebook.com facebook facebook.com
1 2b www.yahoo.com yahoo inc yahoo.com
2 3c www.google.com Google google.com
3 4d www.cisco.com Cisco cisco.com

数据框 2:目的地

  id           website  company_name           tld  match uuid
0 a www.facebook.com facebook facebook.com False NaN
1 b www.y.com Yahoo Inc y.com False NaN
2 c www.g.com Google g.com False NaN
3 d www.g.com Google Inc g.com False NaN
4 e www.facebook.com Facebook Inc facebook.com False NaN

期望的输出:

id           website  company_name           tld  match similar_companies
0 a www.facebook.com facebook facebook.com True Facebook
1 b www.y.com Yahoo Inc y.com False None
2 c www.g.com Google g.com True None
3 d www.g.com Google Inc g.com False None
4 e www.facebook.com Facebook facebook.com True facebook
5 f www.face.uk Facebook Inc face.uk True facebook

uuid
0 1a
1 NaN
2 3c
3 NaN
4 1a
5 1a

当前代码:

# Find if TLD is the same.
match_tld = destination.tld.isin(source.tld)
# Find if Company name is the same.
match_company_name = destination.company_name.isin(
source.company_name)
# Find similar source.
destination[
_SIMILAR_COMPANIES] = destination.company_name.apply(
_FindSimilarCompanies, args=(destination,))
# Find if Company name is the same from similar source.
match_similar_companies = destination.similar_companies.isin(
source.company_name)
# Update match column if TLD or company_name matches.
destination['match'] = match_tld | match_company_name | match_similar_companies
# Extract UUID for TLD matches.
merge_tld = destination.merge(
source[['tld', 'uuid']], on='tld', how='left')
# Extract UUID for company name matches.
destination = destination.merge(
source[['company_name', 'uuid']], on='company_name', how='left')
# I insert new line here!!!
# Combine dataframes.
destination['uuid'] = destination['uuid'].combine_first(merge_tld['uuid'])
logging.info(source)
logging.info(destination)

上面的代码适用于 2 列,但是当我尝试合并新列时,我得到一个 KeyError:(我在插入新代码的地方添加了注释)

destination = destination.merge(
source[['company_name', 'uuid']], on='similar_companies', how='left')

错误:

KeyError: 'similar_companies'

最佳答案

我认为问题是 source 中没有 similar_companies 列,因此有必要 rename:

#for sample data column
_SIMILAR_COMPANIES = 'similar_companies'
destination[_SIMILAR_COMPANIES] = destination.company_name.str.extract('([fF]acebook)')

destination1 = destination.merge(
source[['company_name', 'uuid']], on='company_name', how='left')

destination2 = (destination.merge(
source[['company_name', 'uuid']].rename(columns={'company_name':'similar_companies'}),
on='similar_companies', how='left'))
# Combine dataframes.
merge_tld['uuid'] = (merge_tld['uuid'].combine_first(destination1['uuid'])
.combine_first(destination2['uuid']))
print (merge_tld)
id website company_name tld match similar_companies \
0 a www.facebook.com facebook facebook.com True facebook
1 b www.y.com Yahoo Inc y.com False NaN
2 c www.g.com Google g.com True NaN
3 d www.g.com Google Inc g.com False NaN
4 e www.facebook.com Facebook Inc facebook.com True Facebook

uuid
0 1a
1 NaN
2 3c
3 NaN
4 1a

关于python - 多次匹配后合并 Pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51259144/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com