gpt4 book ai didi

python - Pandas groupby : get best zscore for counts() of each group

转载 作者:行者123 更新时间:2023-12-01 09:20:29 26 4
gpt4 key购买 nike

我有一个 pandas groupby 对象,它返回每种基因类型的计数,大致如下所示(为清楚起见,手动格式化列标题):

counts = df.groupby(["ID", "Gene"]).size()

counts
ID Gene Count
1_1_1 SMARCB1 1
smad 12
1_1_10 SMARCB1 2
smad 17
1_1_100 SMARCB1 3

我需要获取组内的zscore,然后返回zscore最高的基因。

我尝试了以下方法,但它似乎正在计算整个数据集的 zscore,并且没有返回正确的 zscore:

zscore = lambda x: (x - x.mean()) / x.std()
counts = df.groupby(["ID", "Match"]).size().pipe(zscore)

我尝试过转换并得到了相同的结果。

我尝试过:

counts = match_df.groupby(["ID", "Match"]).size().apply(zscore)

这给了我以下错误:

'int' object has no attribute 'mean'

无论我尝试什么,它都不会给出正确的输出。前两行的 zscores 应为 [-1,1],在这种情况下,我将返回 1_1_1 SMARCB1 的行。等等谢谢!

更新

感谢 @ZaxR 的帮助并切换到 numpy 均值和标准差,我能够解决这个问题,如下所示。该解决方案还提供了每个基因的原始计数和 zscore 的摘要数据框:

# group by id and gene match and sum hits to each molecule
counts = df.groupby(["ID", "Match"]).size()

# calculate zscore by feature for molecule counts
# features that only align to one molecule are given a score of 1
zscore = lambda x: (x - np.mean(x)) / np.std(x)
zscores = counts.groupby('ID').apply(zscore).fillna('1').to_frame('Zscore')

# group results back together with counts and output to
# merge with positions and save to file
zscore_df = zscores.reset_index()
zscore_df.columns = ["ID", "Match", "Zscore"]
count_df = counts.reset_index()
count_df.columns = ["ID", "Match", "Counts"]
zscore_df["Counts"] = count_df["Counts"]

# select gene with best zscore meeting threshold
max_df = zscore_df[zscore_df.groupby('ID')['Zscore'].transform(max) \
== zscore_df['Zscore']]

最佳答案

为什么 df.groupby(["ID", "Gene"]).size().transform(zscore) 不起作用的原因是最后一组是只有一个项目的系列,因此当您尝试将 lambda 函数 zscore 应用于单个 [integer] 时,您会收到 'int' object has no attribute 'mean' 错误。请注意,x.mean() 的行为与 pandas 的“mean”不同。

更新

我认为应该这样做:

# Setup code
df = pd.DataFrame({"ID": ["1_1_1", "1_1_1", "1_1_10", "1_1_10", "1_1_100"],
"Gene": ["SMARCB1", "smad", "SMARCB1", "smad", "SMARCB1"],
"Count": [1, 12, 2, 17, 3]})
df = df.set_index(['ID', 'Gene'])

# Add standard deviation for every row
# Note: .transform(zscore) would also work
df['std_dev'] = df.groupby('ID')['Count'].apply(zscore)

# Find the max standard deviation for each group and
# use that as a mask for the original df
df[df.groupby('ID')['std_dev'].transform(max) == df['std_dev']]

Out:
Count std_dev
ID Gene
1_1_1 smad 12 0.707107
1_1_10 smad 17 0.707107

关于python - Pandas groupby : get best zscore for counts() of each group,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50827574/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com