gpt4 book ai didi

bash - 如何从基因列表中挑选多个fasta序列

转载 作者:行者123 更新时间:2023-12-01 09:01:53 26 4
gpt4 key购买 nike

我有两个文件

基因列表文件是这样的

LOC_Os06g12230.1
Pavir.Ab03005
Pavir.J14065
ChrUn.fgenesh
Sevir.1G325700
LOC_Os02g51280.1
Bradi3g59320
Brast04G017400

Fasta 序列文件是这样的

>LOC_Os03g57190.1 pacid=33130570 polypeptide=LOC_Os03g57190.1 locus=LOC_Os03g57190 ID=LOC_Os03g57190.1.MSUv7.0 annot-version=v7.0
ATGGAGGCGGCGGTGGGGGACGGGGAAGGCGGTGGCGGCGGCGGCGGGCGGGGGAAGCGTGGGCGGGGAGGAGGAGGAGG
GGAGATGGTGGAGGCGGTGTGGGGGCAGACGGGGAGTACGGCGTCGCGGATCTACAGGGTGAGGGCGACGGGGGGGAAGG
ACAGGCACAGCAAGGTGTACACGGCGAAGGGAATCCGCGACCGCCGCGTCCGCCTCTCCGTCGCCACCGCCATCCAGTTC
TACGACCTCCAGGACCGCCTCGGCTTCGACCAGCCGAGCAAGGCCATCGAGTGG
>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA
>LOC_Os05g43760.1 pacid=33158388 polypeptide=LOC_Os05g43760.1 locus=LOC_Os05g43760 ID=LOC_Os05g43760.1.MSUv7.0 annot-version=v7.0
ATGACAAGCAATAACAGCACGAATGAGGAGCTCGGCGGCGGCGGCAGGAAGGCGGCCGACAAGCCGAGCGGCGGCGGCGG
CGCCGCCGCCGCCGTGGCGAGCTCGCGGCACTGGTCGGCGTCGACGGAGTCGCGGATCGTGCGCGTGTCGAGGGTGTTCG
GCGGCAAGGACCGTCACAGCAAGGTGAGGACGGTGAAGGGGCTCCGCGACCGGCGGGTGCGGCTGTCGGTGCCGACGGCG
ATCCAGCTCTACGACCTGCAGGACCGGCTGGGGCTCAGCCAGCCGAGCAAGGTGGTCGACT

如果基因名标题行匹配,则序列必须被拉出到新文件中

新文件应该包含

>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA

我试过这样

grep -f genelist.txt -A3 fastafile.txt >> newfasta.txt

但不同的fasta序列长度不同,

模式匹配后,我想选择直到下一个'>'符号出现

最佳答案

能否请您尝试以下。

awk '
FNR==NR{
a[$0]
next
}
/^>/{
found=""
}
($2 in a){
found=1
}
found
' Input_file_gene FS="[> ]" Input_file

输出如下。

>LOC_Os02g51280.1 pacid=33134358 polypeptide=LOC_Os02g51280.1 locus=LOC_Os02g51280 ID=LOC_Os02g51280.1.MSUv7.0 annot-version=v7.0
ATGACCATGGACGTCGCCGGAGACGCCGGAGGTGGCCGCCGCCCAAACTTCCCCTTGCAGCTTCTTGAGAAGAAGGAGGA
CGGGCGGTGCCGGAGGGGAGATGCAGCTGCGGAAGGCGGCGCCGAAGCGGAGCTCCACCAAGGACCGGCACACCAAGGTG
GAAGGGAGGGGGCGGCGCATCCGGATGCCGGCGCTGTGCGCGGCGAGGGTGTTCCAGCTGACGCGGGAGCTGG
>LOC_Os06g12230.1 pacid=33145596 polypeptide=LOC_Os06g12230.1 locus=LOC_Os06g12230 ID=LOC_Os06g12230.1.MSUv7.0 annot-version=v7.0
ATGGATGTCACCGGAGACGGCGGAGGAGGAGGGCAACGGCCCAATTTCCCCCTGCAGCTCCTCGGGAAGAAGGAGGAGCA
GACGTGCTCGACGTCGCAGACTGCCGGGGCGGGCGGCGGCGGCGTCGTGGGCGCGAATGGGTCGGCGGCGGCGGCGCCGC
CGAAGCGGACGTCGACGAAGGACCGGCACACGAAGGTGGACGGGCGGGGGCGGCGCATCCGGATGCCGGCGATCTGCGCC
GCGCGGGTGTTCCAGCTGACGCGGGAGCTCGGGCACAAGACCGACGGCGA

关于bash - 如何从基因列表中挑选多个fasta序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60769591/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com