我有一个数据框(p4p5_merge
),当前如下所示:
SampleID expr Gene Period tag \
1 HSB666 3.663308 ENSG00000147996 5 HSB666|ENSG00000147996
2 HSB666 3.663308 ENSG00000147996 5 HSB666|ENSG00000147996
3 HSB666 3.663308 ENSG00000147996 5 HSB666|ENSG00000147996
4 HSB666 3.663308 ENSG00000147996 5 HSB666|ENSG00000147996
5 HSB651 3.207474 ENSG00000174749 4 HSB651|ENSG00000174749
6 HSB651 3.207474 ENSG00000174749 4 HSB651|ENSG00000174749
7 HSB651 3.207474 ENSG00000174749 4 HSB651|ENSG00000174749
8 HSB651 3.207474 ENSG00000174749 4 HSB651|ENSG00000174749
9 HSB651 3.207474 ENSG00000174749 4 HSB651|ENSG00000174749
10 HSB195 0.214731 ENSG00000188157 4 HSB195|ENSG00000188157
11 HSB195 0.214731 ENSG00000188157 4 HSB195|ENSG00000188157
12 HSB195 0.214731 ENSG00000188157 4 HSB195|ENSG00000188157
14 HSB152 5.062444 ENSG00000188157 4 HSB152|ENSG00000188157
15 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
16 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
17 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
18 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
19 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
20 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
21 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
22 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
23 HSB627 2.062444 ENSG00000174749 4 HSB627|ENSG00000174749
Consequence
1 upstream_gene_variant
2 upstream_gene_variant
3 upstream_gene_variant
4 upstream_gene_variant
5 upstream_gene_variant
6 upstream_gene_variant
7 upstream_gene_variant
8 upstream_gene_variant
9 upstream_gene_variant
10 upstream_gene_variant
11 upstream_gene_variant
12 upstream_gene_variant
14 upstream_gene_variant
15 upstream_gene_variant
16 upstream_gene_variant
17 upstream_gene_variant
18 upstream_gene_variant
19 upstream_gene_variant
20 upstream_gene_variant
21 upstream_gene_variant
22 upstream_gene_variant
23 intron_variant
我现在想按 Gene
分组,按 expr
降序排序,然后将数据帧过滤到 底部 10% 的行expr
值为每个 Gene
组(第 10 个百分位数)。所以我执行以下操作:
1) 按表达式降序排序 (SUCCEEDS)
p4p5_sort= p4p5_merge.sort_values(['expr', 'Gene'],
ascending=[False, True]).reset_index(drop=True)
2) 按基因分组并过滤底部 10% 的表达/基因(失败)
p4p5_bottom10 = (p4p5_sort[p4p5_sort.groupby('Gene')['expr'].
apply(lambda x: x < x.quantile(0.1))])
第 1 步按预期工作,但当我运行第 2 步时,我只得到以下响应:
sys:1: DtypeWarning: Columns (15,16,22,36,37,38,39) have mixed types. Specify dtype option on import or set low_memory=False.
Empty DataFrame
Columns: [SampleID, expr, Gene, Period, tag, Consequence]
Index: []
如果有帮助,我想要完成的 R 等效项是:
p4p5_bottom10 <- p4p5_merge %>% select(Gene, expr, SampleID, Period) %>%
group_by(Gene) %>%
arrange(Gene, desc(expr)) %>%
filter(expr < quantile(expr, 0.1))
您可以将分位数直接应用于 grouby,如下所示:
p4p5_bottom10 = pd.DataFrame(p4p5_sort.groupby(['Gene'])['expr'].quantile(0.1))
我们必须应用 pd.DataFrame() 来转换为 DF。
我是一名优秀的程序员,十分优秀!