gpt4 book ai didi

python - 匹配存储在另一个数据框中的列名称并替换为其 ID

转载 作者:行者123 更新时间:2023-12-04 09:01:13 24 4
gpt4 key购买 nike

我有一个名为 Master 的主数据框,其中包含所有问题的 id。
我有多个数据集包含这些问题作为标题我想用它们的 ID 替换这些标题。
主表看起来像这样:

Question               ID

gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
df1 看起来像这样:
gender         marital status  occupation

Male Single Doctor
Male Divorced Engineer
期望输出
   1            2                 3                 

Male Single Doctor
Male Divorced Engineer
此外,如果 df1 中没有在主数据表中提到 id 的任何新变量,则应为其提供新 ID,并且变量名称和 id 将在 中更新。主 table
例如。
df2 看起来像这样:
gender         marital status  country

Male Single India
Male Divorced UK
所需的 df2 :
1                 2              4

Male Single India
Male Divorced UK
更新的主表将为:
Question               ID

gender 1
sex 1
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
country 4

最佳答案

使用 DataFrame.rename 来自 Series通过另一个数据设置新列名称:

df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
1 2 3
0 Male Single Doctor
1 Male Divorced Engineer
编辑: Question 中存在重复项 df 中的值,所以需要创建唯一的 Question值。一种可能的解决方案是通过 DataFrame.drop_duplicates 删除重复项,以下是示例数据以了解其工作原理:
print (df)
Question ID
0 gender 10 <-duplicates, change ID for test
1 gender 15 <-duplicates, change ID for test
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
您可以测试真实数据中的重复项:
print (df[df.duplicated('Question', keep=False)])
Question ID
0 gender 10
1 gender 15

删除重复项并保留第一行,这里是 ID=10 :
print (df.drop_duplicates('Question').set_index('Question')['ID'])
Question
gender 10
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
Name: ID, dtype: int64

df21 = df1.rename(columns=df.drop_duplicates('Question').set_index('Question')['ID'])
print (df21)
10 2 3
0 Male Single Doctor
1 Male Divorced Engineer
删除重复项并保留第一行,这里是 ID=15 :
print (df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
Question
gender 15
what is your gender 1
sexual orientation 1
marital status 2
occupation 3
whats you job 3
Name: ID, dtype: int64

df22 = df1.rename(columns=df.drop_duplicates('Question', keep='last').set_index('Question')['ID'])
print (df22)
15 2 3
0 Male Single Doctor
1 Male Divorced Engineer


print (df.set_index('Question')['ID'].to_dict())
{'gender': 15, 'what is your gender': 1, 'sexual orientation': 1, 'marital status': 2, 'occupation': 3, 'whats you job': 3}



df22 = df1.rename(columns=df.set_index('Question')['ID'].to_dict())
print (df22)
15 2 3
0 Male Single Doctor
1 Male Divorced Engineer
EDIT1:如果主 DataFrame 中的值不存在并且需要首先附加它们,请使用:
print (df)
Question ID
0 gender 1
1 sex 1
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3

print (df1)
gender marital status country code1 code2
0 Male Single India 4 7
1 Male Divorced UK 3 5
获取 df['Question'] 中不存在的所有列:
cols = df1.columns.difference(df['Question'].tolist(), sort=False)
print (cols)
Index(['country', 'code1', 'code2'], dtype='object')
添加 ID接下来按最大值:
df3 = pd.DataFrame({'Question':cols, 
'ID': np.arange(df['ID'].max() + 1, len(cols) + df['ID'].max() + 1)})
print (df3)
Question ID
0 country 4
1 code1 5
2 code2 6
附加到原文 master DataFrame :
df = pd.concat([df, df3], ignore_index=True)
print (df)
Question ID
0 gender 1
1 sex 1
2 what is your gender 1
3 sexual orientation 1
4 marital status 2
5 occupation 3
6 whats you job 3
7 country 4
8 code1 5
9 code2 6
最后使用原始解决方案:
df2 = df1.rename(columns=df.set_index('Question')['ID'])
print (df2)
1 2 4 5 6
0 Male Single India 4 7
1 Male Divorced UK 3 5

关于python - 匹配存储在另一个数据框中的列名称并替换为其 ID,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63555381/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com