gpt4 book ai didi

python - 有条件地删除行在 pandas 中无法按预期工作

转载 作者:行者123 更新时间:2023-12-01 03:33:49 25 4
gpt4 key购买 nike

我有一个数据框,其中有一个包含重复样本(以 _2 结尾)的样本列和一个详细说明哪个是原始样本的同一列。新类别包含一种突变类型,其中致病/可能致病的破坏性最大,而可能良性的破坏性最小。下面演示了我的数据框的简化/基本版本。

df = pd.DataFrame(columns=['Sample', 'same','New Category'],
data=[
['HG_12_34', 'HG_12_34', 'Pathogenic/Likely Pathogenic'],
['HG_12_34_2', 'HG_12_34', 'Likely Benign'],
['KD_89_9', 'KD_89_9', 'Likely Benign'],
['KD_98_9_2', 'KD_89_9', 'Likely Benign'],
['LG_3_45', 'LG_3_45', 'Likely Benign'],
['LG_3_45_2', 'LG_3_45', 'VUS']
])

我想有条件地删除一个样本或其副本,具体取决于哪个样本在新类别中具有最小的破坏性突变,即如果一个样本具有可能良性,而副本具有致病性/利克利致病性变异,那么我想删除/删除示例行。

我尝试将数据帧传递给一个函数,该函数返回代表要删除的行的索引列表,随后我删除了它们。

def get_unwanted_duplicates_ix(df):

# filter df for samples that have a duplicate
same_only = df.groupby("same").filter(lambda x: len(x) > 1)

list_index_to_delete = []


for num in range(0,same_only.shape[0]-1):

row1 = same_only.irow(num)
row2 = same_only.irow(num+1)
index = list(same_only.index.values)[num]



if row1['Sample']+"_2" == row2['Sample'] or \
row1['Sample'] == row2['Sample']+"_2":

if row1['New Category'] == row2['New Category']:
list_index_to_delete.append(index+1)

elif row1['New Category'] == "Pathogenic/Likely Pathogenic" \
and row2['New Category'] != "Pathogenic/Likely Pathogenic":
list_index_to_delete.append(index+1)

elif row2['New Category'] == "Pathogenic/Likely Pathogenic" \
and row1['New Category'] != "Pathogenic/Likely Pathogenic":
list_index_to_delete.append(index)

elif row1['New Category'] == "VUS" \
and row2['New Category'] != "VUS":
list_index_to_delete.append(index+1)

elif row2['New Category'] == "VUS" \
and row1['New Category'] != "VUS":
list_index_to_delete.append(index)

elif row1['New Category'] == 'Likely Benign' \
and row2['New Category'] == 'Likely Benign':
list_index_to_delete.append(index+1)

else:
list_index_to_delete.append(index+1)

return list_index_to_delete

unwanted = get_unwanted_duplicates_ix(df)
df = df.drop(df.index[unwanted])

上面的函数一团糟,不出所料,并没有像我希望的那样发挥作用。如果方向正确,我们将不胜感激。

最佳答案

首先,用整数替换突变严重性(值越高意味着破坏性越大)。

df['New Category code'] = df['New Category'].replace(
{'Likely Benign': 1, 'VUS': 2, 'Pathogenic/Likely Pathogenic': 3})

下一个命令取决于您是否要保留具有相同严重性的多行。如果是,则按相同列分组并选择具有最大严重性代码的行:

df[df.groupby('same')['New Category code'].transform(max) == df['New Category code']]                   

Sample same New Category New Category code
0 HG_12_34 HG_12_34 Pathogenic/Likely Pathogenic 3
2 KD_89_9 KD_89_9 Likely Benign 1
3 KD_98_9_2 KD_89_9 Likely Benign 1
5 LG_3_45_2 LG_3_45 VUS 2

如果不是(每组中始终只保留一行),则按严重性升序对值进行排序,并取每组中的最后一行(感谢 @JonClements 的想法):

df.sort_values('New Category code').groupby('same').last()

Sample New Category New Category code
same
HG_12_34 HG_12_34 Pathogenic/Likely Pathogenic 3
KD_89_9 KD_98_9_2 Likely Benign 1
LG_3_45 LG_3_45_2 VUS 2

关于python - 有条件地删除行在 pandas 中无法按预期工作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40553480/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com