gpt4 book ai didi

python - 检查两个 df 是否相同模式并在 pandas 中使用 groupby

转载 作者:行者123 更新时间:2023-12-02 02:39:01 24 4
gpt4 key购买 nike

你好,我有一个 df1 文件,例如:

Acc_number
ACC1.1_CP_Sp1_1
ACC2.1_CP_Sp1_1
ACC3.1_CP_Sp1_1
ACC4.1_CP_Sp1_1

和另一个 df2,例如:

Cluster_nb SeqName
Cluster1 YP_009216714
Cluster1 YP_002051918
Cluster1 JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1 NW_014464344.1:68901-69716(-):Sp2_3
Cluster1 YP_001956729
Cluster1 ACC1.1_CP_Sp1_1
Cluster1 YP_009213712
Cluster2 ACC2.1_CP_Sp1_1
Cluster2 NR_014464231.1:35866-36717(-):Sp1_1
Cluster2 NR_014464232.1:35889-36788(-):Sp1_1
Cluster2 YP_009213728
Cluster3 ACC3.1_CP_Sp1_1
Cluster3 NK_014464231.1:35772-38898(-):Sp1_2
Cluster3 NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3 YP_009213723
Cluster3 YP_009213739

如果 groupby Cluster_nb 包含 Acc_number[i],我想检查 df1 中的每个 Acc_number在其 (+ or -):... 部分中还包含另一个具有相同扩展名的序列(Acc_number_CP_ 之后的部分) .

例如

for ACC1.1_CP_Sp1_1 as i

我看到了:

df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)

Cluster_nb SeqName
Cluster1 YP_009216714
Cluster1 YP_002051918
Cluster1 JZSA01005235.1:37071-37973(-):Sp1_1
Cluster1 NW_014464344.1:68901-69716(-):Sp2_3
Cluster1 YP_001956729

第 3 行中的序列 JZSA01005235.1:37071-37973(-):Sp1_1 在其末尾具有相同的 Sp1_1 模式。

所以这里的答案是肯定的,ACC1.1_CP_Sp1_1 与另一个具有相同结尾的序列位于同一簇中(但名称中带有 (-or +):)

for ACC3.1_CP_Sp1_1 as i

我看到了:

df=df2.loc[df2['SeqName']==i]
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
print(df3)

Cluster3 ACC3.1_CP_Sp1_1
Cluster3 NK_014464231.1:35772-38898(-):Sp1_2
Cluster3 NZ_014464232.1:3533-78787(+):Sp1_2
Cluster3 YP_009213723
Cluster3 YP_009213739

我看到在簇中没有其他序列具有与 ACC3.1_CP_Sp1_1 相同的结尾,所以答案是否定的。

结果应该总结在df3中:

Acc_number present cluster
ACC1.1_CP_Sp1_1 Yes Cluster1
ACC2.1_CP_Sp1_1 Yes Cluster2
ACC3.1_CP_Sp1_1 No NaN
ACC4.1_CP_Sp1_1 No NaN

非常感谢你的帮助

我试过了:

for CP in df1['Acc_number']:
df=df2.loc[df2['SeqName']==CP]
try:
Cluster_number=df['Cluster_nb'].iloc[0]
df3=df2.loc[df2['Cluster_nb']==Cluster_number]
for a in df3['SeqName']:
if '(+)' in a or '(-)' in a:
if re.sub('.*_CP_','',CP) in a:
new_df=new_df.append({"Cluster":Cluster_number,"Acc_nb":CP,"present":'yes'}, ignore_index=True)
print(CP,'yes')
except:
continue

最佳答案

我在代码中做了注释;概述是为每一行获取唯一标识符,合并数据框并仅保留您感兴趣的列:

  #create an 'ending' column 
#where u split off the ends after ':'
df1['ending'] = df1.loc[df1.SeqName.str.contains(':'),'SeqName']
df1['ending'] = df1['ending'].str.split(':').str[-1]
#get the cluster number and add to the ending column
#it will serve as a unique identifier for each row
df1['ending'] = df1.Cluster_nb.str[-1].str.cat(df1['ending'],sep='_')
#get rid of null and duplicates; keep only relevant columns
df1 = df1.dropna().drop('SeqName',axis=1).drop_duplicates('ending')

#create ending column here as well
df['ending'] = df['Acc_number'].str.extract(r'((?<=ACC)\d)')
#merge acc_number with the ending to serve as unique identifier
df['ending'] = df['ending'].str.cat(df['Acc_number'].str.extract(r'((?<=P_).*)'),sep='_')

#merge both dataframes
(df
.merge(df1,on='ending',how='left')
#keep only relevant columns
.filter(['Acc_number','Cluster_nb'])
#create present column
.assign(present = lambda x: np.where(x.Cluster_nb.isna(),'no','yes'))
.rename(columns={'Cluster_nb':'cluster'})
)

Acc_number cluster present
0 ACC1.1_CP_Sp1_1 Cluster1 yes
1 ACC2.1_CP_Sp1_1 Cluster2 yes
2 ACC3.1_CP_Sp1_1 NaN no
3 ACC4.1_CP_Sp1_1 NaN no

关于python - 检查两个 df 是否相同模式并在 pandas 中使用 groupby,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60949091/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com