gpt4 book ai didi

awk - 在特定模式后添加换行符

转载 作者:行者123 更新时间:2023-12-01 08:46:52 28 4
gpt4 key购买 nike

我有一个包含数千个蛋白质序列的文件,格式如下:

>EgrG_000615900transcript=EgrG_000615900gene=EgrG_000615900MAIRSFGRIAPARSLLIHFKLVTDAFHGEAPSGPYLLPQAARSLLCEKCDGKCVICDSYVRPCTLVRICDECNYGSYQGRCVICGGTGVSDAYYCRESPKPTSFTKGRNMDSKNDLISNKFTMHADVIISILKPGLFVIVDFFIV

Each protein is currently on its own line. The 'MAIRS...FFIV' represents the protein sequence and the stuff before it is the accession. i would like the protein to be on a new line, i.e. I want there to be a line break between '....EgrG_000615900' (the numbers here vary, but there are always 9 digits) and 'MAIRS....'. Ideally, the output would look like this;

>EgrG_000615900transcript=EgrG_000615900gene=EgrG_000615900MAIRSFGRIAPARSLLIHFKLVTDAFHGEAPSGPYLLPQAARSLLCEKCDGKCVICDSYVRPCTLVRICDECNYGSYQGRCVICGGTGVSDAYYCRESPKPTSFTKGRNMDSKNDLISNKFTMHADVIISILKPGLFVIVDFFIV

Each protein in the file begins with the pattern >EgrG_.........transcript=EgrG_.........gene=EgrG_......... (dot representing any digit 0-9).

I have tried

sed  's/>EgrG_.........transcript=EgrG_.........gene=EgrG_........./&\n/g' input file > output file

但这不起作用

更新感谢大家的关注。事后看来,我觉得我可以简化我的要求。以下是我的文件中的较大样本;

>EgrG_000615900 transcript=EgrG_000615900 gene=EgrG_000615900MAIRSFGRIAPARSLLIHFKLVTDAFHGEAPSGPYLLPQAARSLLCEKCDGKCVICDSYVRPCTLVRICDECNYGSYQGRCVICGGTGVSDAYYCRESPKPTSFTKGRNMDSKNDLISNKFTMHADVIISILKPGLFVIVDFFIV>EgrG_001057700 transcript=EgrG_001057700 gene=EgrG_001057700MEESNSEPVIFQVSKLAGRHNYTSFGHKEDLDPQNKFSIPSPADHPGKHRSVLRSLFKGMSSGGKNVALEEQQPTYRQAGSSSHHRYHIHHYPHNPSDDRRPLRGPCFPHMSSSSQSASAFSSPNSSSSPGQRVSTFHAGLREEVLEQDGTSSTTQANFSEEPLVLLVLFPASKSKEAVLPLTTVGRNDCCATASVFTLRLASTYCDVAFFINYFS>EgrG_000972800 transcript=EgrG_000972800 gene=EgrG_000972800MTSYCAVFMVPLLTLLILWGHLPACESTPLPSELIVRRGRTLQDLYRYVQQQYLMCLKCPNCPCETKFNIRRRSGGINWPQYMNASGMTAKNMEEALDDY>EgrG_000198800 transcript=EgrG_000198800 gene=EgrG_000198800MPETGKSGGTTISSKTKSTAVSSGTPVKPMKSESCRLISGESPTSVVILKPAWASFVTPFPPVQEKCCKCGQLVRFSDRIELLGKVFHESCFRCAVCNRPLSNSEAIFHSNAWNCEAHASSYPRLYAS`

虽然它似乎不在这里,但在我的文件中,这四个序列中的每一个都在一行上。尽管登录的数字在各个蛋白质之间发生变化,但字符保持相同(因此可以表示加入;>EgrG_......... 转录本=EgrG_......... 基因=EgrG_......... )。您可能会注意到,每种情况下的实际蛋白质序列均以“M”开头。这些是我的文件中所有蛋白质/线路的唯一一致性。目前,我的文件由单行上的登录号和蛋白质序列组成,但我希望对上述序列进行格式化;

>EgrG_000615900 transcript=EgrG_000615900 gene=EgrG_000615900MAIRSFGRIAPARSLLIHFKLVTDAFHGEAPSGPYLLPQAARSLLCEKCDGKCVICDSYVRPCTLVRICDECNYGSYQGRCVICGGTGVSDAYYCRESPKPTSFTKGRNMDSKNDLISNKFTMHADVIISILKPGLFVIVDFFIV`>EgrG_001057700 transcript=EgrG_001057700 gene=EgrG_001057700MEESNSEPVIFQVSKLAGRHNYTSFGHKEDLDPQNKFSIPSPADHPGKHRSVLRSLFKGMSSGGKNVALEEQQPTYRQAGSSSHHRYHIHHYPHNPSDDRRPLRGPCFPHMSSSSQSASAFSSPNSSSSPGQRVSTFHAGLREEVLEQDGTSSTTQANFSEEPLVLLVLFPASKSKEAVLPLTTVGRNDCCATASVFTLRLASTYCDVAFFINYFS`>EgrG_000972800 transcript=EgrG_000972800 gene=EgrG_000972800MTSYCAVFMVPLLTLLILWGHLPACESTPLPSELIVRRGRTLQDLYRYVQQQYLMCLKCPNCPCETKFNIRRRSGGINWPQYMNASGMTAKNMEEALDDY>EgrG_000198800 transcript=EgrG_000198800 gene=EgrG_000198800MPETGKSGGTTISSKTKSTAVSSGTPVKPMKSESCRLISGESPTSVVILKPAWASFVTPFPPVQEKCCKCGQLVRFSDRIELLGKVFHESCFRCAVCNRPLSNSEAIFHSNAWNCEAHASSYPRLYAS`

即登录号在一行,蛋白质序列在下一行。总之,一条线分割在

>EgrG_......... transcript=EgrG_......... gene=EgrG_.........

第一个“M”是必需的。

再次感谢大家的耐心等待

最佳答案

您可以使用二十个氨基酸列表来提取蛋白质序列(IUPAC 表示法,无终止密码子符号)

alanine - Aarginine - Rasparagine - Naspartic acid - Dcysteine - Cglutamine - Qglutamic acid - Eglycine - Ghistidine - Hisoleucine - Ileucine - Llysine - Kmethionine - Mphenylalanine - Fproline - Pserine - Sthreonine - Ttryptophan - Wtyrosine - Yvaline - Vspecial cases:asparagine/aspartic acid - Bglutamine/glutamic acid - Z

With gnu-sed:

sed -r 's/[ARNDCQEGHILKMFPSTWYVBZ]+$/\n&/' file

使用sed

sed 's/[ARNDCQEGHILKMFPSTWYVBZ]*$/\'$'\n&/g' file

你得到,fasta格式对应,

>EgrG_000615900transcript=EgrG_000615900gene=EgrG_000615900MAIRSFGRIAPARSLLIHFKLVTDAFHGEAPSGPYLLPQAARSLLCEKCDGKCVICDSYVRPCTLVRICDECNYGSYQGRCVICGGTGVSDAYYCRESPKPTSFTKGRNMDSKNDLISNKFTMHADVIISILKPGLFVIVDFFIV

关于awk - 在特定模式后添加换行符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42462217/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com