gpt4 book ai didi

python - 3 个数据框和 3 个规则正在运行以将数据插入另一个数据框 - 没有公共(public)列 - 大数据

转载 作者:行者123 更新时间:2023-12-01 00:51:18 25 4
gpt4 key购买 nike

我有 3 个不同的数据帧,可以使用下面给出的代码生成

data_file= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European'],'Marital_status': ['Single','Married','Widowed'],'Smoke_status':['Yes','No','No']})
map_file= pd.DataFrame({'gender': ['1.Male','2. Female','3. Not disclosed'],'ethnicity': ['1.Chinese','2. Indian','3.European'],
'Marital_status':['1.Single','2. Married','3 Widowed'],'Smoke_status':['1. Yes','2. No',np.nan]})
hash_file = pd.DataFrame({'keys':['gender','ethnicity','Marital_status','Smoke_status','Yes','No','Male','Female','Single','Married','Widowed','Chinese','Indian','European'],'values':[21,22,23,24,125,126,127,128,129,130,131,141,142,0]})

可以使用下面的代码生成另一个应填充输出的空数据框

columns = ['person_id','obsid','valuenum','valuestring','valueid']
obs = pd.DataFrame(columns=columns)

我想要实现的目标显示在表格中,您可以在其中看到如何填充数据的规则和说明

enter image description here

我确实尝试通过 for 循环方法,但是一旦我将其拆开,我就丢失了列名称,并且不确定如何进一步进行。

a=1
for i in range(len(data_file)):
df_temp = data_file[i:a]
a=a+1
df_temp=df_temp.unstack()
df_temp = df_temp.to_frame().reset_index()

如何让我的输出数据框填充如下所示(ps:我只显示了 person_id = 1 和 4 列),但实时情况下,我有超过 25k 个人,每个人有 400 列。因此,与我的 for 循环不同,任何优雅且高效的方法都是有帮助的。

enter image description here

最佳答案

聊天后并删除重复数据可以使用:

s = hash_file.set_index('VARIABLE')['concept_id']
df1 = map_file.melt().dropna(subset=['value'])
df1[['valueid','valuestring']] = df1.pop('value').str.extract('(\d+)\.(.+)')
df1['valuestring'] = df1['valuestring'].str.strip()

columns = ['studyid','obsid','valuenum','valuestring','valueid']
obs = data_file.melt('studyid', value_name='valuestring').sort_values('studyid')

#merge by 2 columns variable, valuestring
obs = (obs.merge(df1, on=['variable','valuestring'], how='left')
.rename(columns={'valueid':'valuenum'}))
obs['obsid'] = obs['variable'].map(s)
obs['valueid'] = obs['valuestring'].map(s)

#map by only one column variable
s1 = df1.drop_duplicates('variable').set_index('variable')['valueid']
obs['valuenum_new'] = obs['variable'].map(s1)

obs = obs.reindex(columns + ['valuenum_new'], axis=1)
print (obs)

#compare number of non missing rows
print (len(obs.dropna(subset=['valuenum'])))
print (len(obs.dropna(subset=['valuenum_new'])))

关于python - 3 个数据框和 3 个规则正在运行以将数据插入另一个数据框 - 没有公共(public)列 - 大数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56556191/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com