gpt4 book ai didi

python - 评估重复样本的字段数据是否不同以及是否复制数据?

转载 作者:太空宇宙 更新时间:2023-11-03 16:02:43 26 4
gpt4 key购买 nike

我想评估样本及其副本(以 _2 结尾)是否在其年龄、家族史和诊断字段中输入了数据。如果一个样本有条目,而其副本没有(所有“-”条目),那么我想将样本中的条目复制到重复字段。评估应该以相反的方式进行:如果重复项有条目而样本没有,则将它们复制到样本字段。

基本上,我希望 input_df 看起来像desired_df(如下所示)。

input_df = pd.DataFrame(columns=['Sample', 'Date','Age', 'Family History', 'Diagnosis'],
data=[
['HG_12_34', '12/3/12', '23', 'Y', 'Jerusalem Syndrome'],
['LG_3_45', '3/4/12', '45', 'N', 'Paris Syndrome'],
['HG_12_34_2', '4/5/13', '-', '-', '-'],
['KD_89_9', '8/9/12', '-', '-', '-'],
['KD_98_9_2', '6/1/13', '54', 'Y', 'Chronic Hiccups'],
['LG_3_45_2', '4/4/10', '59', 'N', 'Dangerous Sneezing Syndrome']
])

desired_df = pd.DataFrame(columns=['Sample', 'Date','Age', 'Family History', 'Diagnosis'],
data=[
['HG_12_34', '12/3/12', '23', 'Y', 'Jerusalem Syndrome'],
['LG_3_45', '3/4/12', '45', 'N', 'Paris Syndrome'],
['HG_12_34_2', '4/5/13', '23', 'Y', 'Jerusalem Syndrome'],
['KD_89_9', '8/9/12', '54', 'Y', 'Chronic Hiccups'],
['KD_98_9_2', '6/1/13', '54', 'Y', 'Chronic Hiccups'],
['LG_3_45_2', '4/4/10', '59', 'N', 'Dangerous Sneezing Syndrome']
])

下面详细介绍了我对此的真正低效且不完整的尝试:

def testing(duplicate, df):
''' Checking difference in phenotype data between duplicates
and return the sample name if
'''
# only assess the duplicate
if duplicate['Sample'][:-2] in list(df['Sample'].unique()):

# get sam row
sam = df[df['Sample'] == duplicate['Sample'][:-2]]

# store the Age, Family History and Diagnosis in a list for each sample
sam_pheno = sam.iloc[0][2:4].fillna("-").tolist()
duplicate_pheno = duplicate[2:4].fillna("-").tolist()

# if the duplicate sample has nothing in these fields then return the
# orginal sample name
if len(set(duplicate_pheno)) == 1 and list(set(duplicate_pheno))[0] == "-" \
and len(set(sam_pheno)) > 1:
return duplicate['Sample'][:-2]






# this creates a column called Pheno which has the name of the sample which contains the phenotype data that they should share. This is intended so that I can somehow copy over the phenotype data from the sample name in the Pheno field. However, I have no idea how to do this.
input_df['Pheno'] = input_df.apply(lambda x: testing(x, input_df), axis =1)

最佳答案

您可以使用:

#replace all - values to NaN
input_df = input_df.replace('-',np.nan)
#all values end with _2 and longer as 7
mask = (input_df.Sample.str.endswith('_2')) & (input_df.Sample.str.len() > 7)
#create new columnn same with column Sample + remove last 2 chars (_2)
input_df.ix[mask, 'same'] = input_df.ix[mask, 'Sample'].str[:-2]
#replace NaN in same by Sample column
input_df.same = input_df.same.combine_first(input_df.Sample)
#sort values
input_df = input_df.sort_values(['same','Family History'], ascending=False)
#replace NaN by forward filling
input_df[['Age','Family History','Diagnosis']] =
input_df[['Age','Family History','Diagnosis']].ffill()
#get original index by sorting
input_df.sort_index(inplace=True)
#remove column same
input_df.drop('same', axis=1, inplace=True)

print (input_df)
Sample Date Age Family History Diagnosis
0 HG_12_34 12/3/12 23 Y Jerusalem Syndrome
1 LG_3_45 3/4/12 45 N Paris Syndrome
2 HG_12_34_2 4/5/13 23 Y Jerusalem Syndrome
3 KD_89_9 8/9/12 54 Y Chronic Hiccups
4 KD_98_9_2 6/1/13 54 Y Chronic Hiccups
5 LG_3_45_2 4/4/10 59 N Dangerous Sneezing Syndrome
<小时/>
print (desired_df)                   
Sample Date Age Family History Diagnosis
0 HG_12_34 12/3/12 23 Y Jerusalem Syndrome
1 LG_3_45 3/4/12 45 N Paris Syndrome
2 HG_12_34_2 4/5/13 23 Y Jerusalem Syndrome
3 KD_89_9 8/9/12 54 Y Chronic Hiccups
4 KD_98_9_2 6/1/13 54 Y Chronic Hiccups
5 LG_3_45_2 4/4/10 59 N Dangerous Sneezing Syndrome

关于python - 评估重复样本的字段数据是否不同以及是否复制数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40174063/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com