gpt4 book ai didi

linux - 如何使用 sed 或 python 重写文件的最后一列

转载 作者:太空狗 更新时间:2023-10-29 12:39:50 25 4
gpt4 key购买 nike

我有这个 gff 文件,它似乎不符合我期望的格式。问题出在最后一列(尽管如果我按制表符拆分,某些行似乎确实有更多制表符)。这是我所拥有的:

scaffold10x_1000_pilon  AUGUSTUS        gene    12711   22079   0.67    -       .       g1
scaffold10x_1000_pilon AUGUSTUS transcript 12711 22079 0.47 - . g1.t1
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.89 - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13936 21904 0.5 - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.99 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.64 - 2 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 21905 22079 0.67 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS start_codon 22077 22079 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS transcript 12711 14150 0.2 - . g1.t2
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.91 - . transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13936 14128 0.2 - . transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.96 - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.45 - 2 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 14129 14150 0.21 - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS start_codon 14148 14150 . - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS gene 41722 42102 0.32 + . g2
scaffold10x_1000_pilon AUGUSTUS transcript 41722 42102 0.32 + . g2.t1
scaffold10x_1000_pilon AUGUSTUS start_codon 41722 41724 . + 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon AUGUSTUS CDS 41722 42102 0.32 + 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon AUGUSTUS stop_codon 42100 42102 . + 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon AUGUSTUS gene 106074 106640 1 + . g3
scaffold10x_1000_pilon AUGUSTUS transcript 106074 106640 1 + . g3.t1
scaffold10x_1000_pilon AUGUSTUS start_codon 106074 106076 . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon AUGUSTUS CDS 106074 106640 1 + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon AUGUSTUS stop_codon 106638 106640 . + 0 transcript_id "g3.t1"; gene_id "g3";

这是我想要的:

scaffold10x_1000_pilon  AUGUSTUS        gene    12711   22079   0.67    -       .       ID=g1
scaffold10x_1000_pilon AUGUSTUS transcript 12711 22079 0.47 - . ID=g1.t1;Parent=g1
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.89 - . Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS intron 13936 21904 0.5 - . Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.99 - 0 ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.64 - 2 ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS CDS 21905 22079 0.67 - 0 ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS start_codon 22077 22079 . - 0 Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS transcript 12711 14150 0.2 - . ID=g1.t2;Parent=g1
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.91 - . Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS intron 13936 14128 0.2 - . Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.96 - 0 ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.45 - 2 ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS CDS 14129 14150 0.21 - 0 ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS start_codon 14148 14150 . - 0 Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS gene 41722 42102 0.32 + . ID=g2
scaffold10x_1000_pilon AUGUSTUS transcript 41722 42102 0.32 + . ID=g2.t1;Parent=g2
scaffold10x_1000_pilon AUGUSTUS start_codon 41722 41724 . + 0 Parent=g2.t1
scaffold10x_1000_pilon AUGUSTUS CDS 41722 42102 0.32 + 0 ID=g2.t1.cds;Parent=g6.t1
scaffold10x_1000_pilon AUGUSTUS stop_codon 42100 42102 . + 0 Parent=g2.t1
scaffold10x_1000_pilon AUGUSTUS gene 106074 106640 1 + . ID=g3
scaffold10x_1000_pilon AUGUSTUS transcript 106074 106640 1 + . ID=g3.t1;Parent=g3
scaffold10x_1000_pilon AUGUSTUS start_codon 106074 106076 . + 0 Parent=g3.t1
scaffold10x_1000_pilon AUGUSTUS CDS 106074 106640 1 + 0 ID=g3.t1.cds;Parent=g3.t1
scaffold10x_1000_pilon AUGUSTUS stop_codon 106638 106640 . + 0 Parent=g3.t1

我曾尝试在 linux 中使用 grepsed 命令,似乎可以完成它。然后解析为 python。

说到 python,我尝试在选项卡中读取文件,然后使用基于列的索引,处理第 3 列和第 9 列,我可以将其索引为数据 [2] 和数据 [8]。

这是我写的,我知道它可能不会那么难,只是我的想法,我对 python 也有点陌生。

data = open("my my_bad_gff", 'r')
new_file = ''
for line in data:

columns = line.rstrip("\n").split("\t")

scaffold = columns[0]
source = columns[1]
feature = columns[2]
start = columns[3]
end = columns[4]
score = columns[5]
strand = columns[6]
frame = columns[7]
attribute = columns[8]

if feature == 'gene': #Im trying to take the row called gene, and assign its content as x, which is g1 in this case
a = str(columns[8])
b = 'ID='+ a #which I think should give me ID=g1
if feature == 'transcript':
columns[8] = a + '.t1' + ';Parent=' + a # hopint it gives me ID=g1.t1;Parent=g1, but how can i make sure '.t1' is not fixes, since transcript number can ncrease for each gene

if feature == 'intron' and 'start_codon' and 'stop_codon':
columns [8] = 'Parent=' + a + '.t1'# should give me Parent=g1.t1
d = columns [8]
if feature == 'CDS':
columns[8] = a + '.t1' + 'cds;' + d #hoping this gives me ID=g1.t1.cds;Parent=g1.t1
new_file.append(data)

是否有一个命令行可以为我做这个,或者我必须只使用 python?谢谢

最佳答案

也许这可以让您开始使用 awk

$ awk -F'\t' '{match($9,"(g[0-9])",m)} 
$3=="gene"{$9="ID="m[1]}

# add other conditions
# using the same template
# ...

1' file

请注意,这里您实际上指的是OR,而不是AND feature == 'intron' and 'start_codon' and 'stop_codon'

match($9,"(g[0-9])",m)是提取g1,g2等值从最后一列到 m[1],其余部分应该易于阅读。

关于linux - 如何使用 sed 或 python 重写文件的最后一列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49382718/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com