gpt4 book ai didi

python - 按行名称修改 pandas 数据框

转载 作者:行者123 更新时间:2023-12-01 02:52:32 25 4
gpt4 key购买 nike

首先,我认为问题标题没有很好地解释问题。请随意更改标题或推荐更好的标题。

我正在读取以下格式的 CSV 文件: enter image description here

"sample","module","status","tot.seq","seq.length","pct.gc","pct.dup"
"ERR435952_cleaned_1","Basic Statistics","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per base sequence quality","FAIL","15529112","62",47,41.66
"ERR435952_cleaned_1","Per tile sequence quality","FAIL","15529112","62",47,41.66
"ERR435952_cleaned_1","Per sequence quality scores","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per base sequence content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per sequence GC content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Per base N content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Sequence Length Distribution","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Sequence Duplication Levels","WARN","15529112","62",47,41.66
"ERR435952_cleaned_1","Overrepresented sequences","WARN","15529112","62",47,41.66
"ERR435952_cleaned_1","Adapter Content","PASS","15529112","62",47,41.66
"ERR435952_cleaned_1","Kmer Content","FAIL","15529112","62",47,41.66
"ERR435952_cleaned_2","Basic Statistics","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per base sequence quality","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per tile sequence quality","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Per sequence quality scores","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per base sequence content","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Per sequence GC content","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Per base N content","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Sequence Length Distribution","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Sequence Duplication Levels","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Overrepresented sequences","WARN","15529112","62",48,42.44
"ERR435952_cleaned_2","Adapter Content","PASS","15529112","62",48,42.44
"ERR435952_cleaned_2","Kmer Content","FAIL","15529112","62",48,42.44

我想将其转换为类似的内容,这样我就可以根据 PASS/FAIL/WARN 值创建一个简单的热图(包括读取总数:tot.seq): enter image description here

我知道我可以通过计算行数来做到这一点(每个模块/特征值的间隔之间存在相关性),但这并不完全整洁,我不确​​定它对于大型数据是否有效数据集。有没有一种方法可以根据名称映射值,而不是按照间隔(即 i,i+n...等等)

最佳答案

使用set_index + unstack ,还添加 reset_index对于索引和 rename_axis 中的列用于删除模块 - 列名称:

df = df.set_index(['sample', 'tot.seq', 'module'])['status'].unstack() \
.reset_index().rename_axis(None, axis=1)
print (df)
sample tot.seq Adapter Content Basic Statistics \
0 ERR435952_cleaned_1 15529112 PASS PASS
1 ERR435952_cleaned_2 15529112 PASS PASS

Kmer Content Overrepresented sequences Per base N content \
0 FAIL WARN PASS
1 FAIL WARN PASS

Per base sequence content Per base sequence quality Per sequence GC content \
0 PASS FAIL PASS
1 PASS PASS WARN

Per sequence quality scores Per tile sequence quality \
0 PASS FAIL
1 PASS WARN

Sequence Duplication Levels Sequence Length Distribution
0 WARN PASS
1 WARN PASS

但是如果得到:

ValueError: Index contains duplicate entries, cannot reshape

然后有重复项并需要聚合数据:

print (df)
sample module status tot.seq \
0 ERR435952_cleaned_1 Basic Statistics PASS 15529112
1 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112
2 ERR435952_cleaned_1 Per base sequence quality FAIL 15529112
3 ERR435952_cleaned_1 Per sequence quality scores PASS 15529112

seq.length pct.gc pct.dup
0 62 47 41.66
1 62 47 41.66
2 62 47 41.66
3 62 47 41.66

df = df.pivot_table(index=['sample', 'tot.seq'], columns='module', values='status', aggfunc=', '.join) \
.reset_index().rename_axis(None, axis=1)
print (df)
sample tot.seq Basic Statistics Per base sequence quality \
0 ERR435952_cleaned_1 15529112 PASS FAIL, FAIL

Per sequence quality scores
0 PASS
<小时/>
df = df.groupby(['sample', 'tot.seq', 'module'])['status'].apply(', '.join).unstack() \
.reset_index().rename_axis(None, axis=1)
print (df)

sample tot.seq Basic Statistics Per base sequence quality \
0 ERR435952_cleaned_1 15529112 PASS FAIL, FAIL

Per sequence quality scores
0 PASS

关于python - 按行名称修改 pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44591929/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com