gpt4 book ai didi

linux - 如何通过基因 ID 从 Fasta 文件中检索序列

转载 作者:行者123 更新时间:2023-12-05 02:44:58 25 4
gpt4 key购买 nike

我知道这个问题已被问过一百次,但我整天都在做这个问题,但我似乎无法解决这个问题。我有一个看起来像这样的 fasta 文件...

>BGI_novel_T016697 Solyc03g033550.3.1
CTGACGTATACAATTAAGCCGCGAAAAATCTACTTTTTTTTTAATAGATATGAATTTCTTTTGTTTCGTATAATGAAGTATTTGTTCCAACAATGTTTAATTATTAGGCATTTGGAATGTGATGGGGCAACTAACAAAGAAGCCAATATCAACATCAATTAACAAACATATGATATAATTCTAGTGAAGTGAAAGCCAAGATATGAAACTCTCCACCCACACTATCTTAAATGATCTTTTTTAAAACATTCTAATTAGGTGATAACTAAAAGCAATAATTCTACCAATTTTGAAACAAACAATATGGTCCC
>BGI_novel_T016313 Solyc03g025570.2.1
TTCAAGTGTTAGTTTCACATCATCACGTTTGGACCTACGTTTCTATATTAGAACATATTCTAACTGATCTCTAGCTGTTATTCATGGGATTGTAAGAAATTTGTATCCCTCTCCGGATTTTACTTTGATCGCCACAAAATGAACATATGCTTTCAATTTTCTATGATGAAAAATCAGCCTCTCTCAATATTGGGTTTAAA
>BGI_novel_T018109 Solyc03g080075.1.1
GCAAGGGAAAGAAGTATTACTAGAGGAGATTTTCCCAACAGTTTTCATTTACACACATGGGTTAAGTATTCATAAATAAAAGAGAAAAATCTGTTTATAAGTTGGAGAGTAGTATAAATACAGGAGATTTTCCCAACAGTTTTCATTTATACACATGGGTTAAGTATTCTTAAATAGAAAATCGGAAGTATTATAAATTCTCACTCAAAGAAACCACGTTTGCTCATTTTCGTTATTCCCTTAAAAACATGGGAAGATGAAAGAAAAAAACTAACACATAAAAAGATTGTGAGTTTACTTATTCATGGAGAATTCCCCATTTAAGTTGACAATATTTTTCTATGGTCTTGAACGGCCAGAAAAGTTAATATCCACAACTATTTTCCACTCAATAAGTGTTCCGATACCGTTGAACTTTTTAATATTTTGCACGCCCTTCGTGAAATGTTTTACTCCGTTACTGTCGCGATAATGATGTTTAAAAT
>BGI_novel_T016817 BGI_novel_G001220
GCCCAAGTCATAGGTAGTGCCTGTGCGGGTTGACACTCAACATGTGACCGCCACCACATTTTGGCATTTCCCTGAAACTGATAGGTTACAAACTCAATGCCAAATCATTCCACTATGCCCATTTTATGTAGTAACTCATGACAATCAACCAGAAAATCGTAGGCATCCTCAGATTCAGCACCCTTGAAGACTGGAGGTTTCAATTTCAAGAACTTACTGAAAAATTCATGCTGATCACTTGTCATTATAGGCCCTGTAGTCAAACGAGGAAACGTGCCTATTTCCAATGAGGCATCCATG
>BGI_novel_T016141 Solyc03g007600.3.1

我想从 .txt 文件中检索与基因 ID 匹配的序列:

Solyc00g256710.2.1
Solyc01g010890.3.1
Solyc01g056990.3.1
Solyc01g060050.2.1
Solyc01g081120.2.1
Solyc01g097740.3.1
Solyc01g098180.3.1
Solyc01g102320.1.1
Solyc01g106420.3.1
Solyc01g111580.3.1
Solyc01g111970.3.1
Solyc02g005530.2.1
Solyc02g031780.1.1
Solyc02g064595.1.1
Solyc02g081920.3.1
Solyc02g084220.3.1

现在,我已经尝试过 samtools 和 FaSomeRecords,但这两种方法都没有产生任何输出。我想这是因为标题还包含成绩单 ID(我可以忽略)你们对我有什么建议吗?如果您需要更多信息,请告诉我。干杯!

最佳答案

使用 Perl 单行代码、grepseqtk subseq 提取所需的 fasta 序列:

# Create test input:

cat > in.fasta <<EOF
>BGI_novel_T016697 Solyc03g033550.3.1
CTGACGTATACAATTAAGCCGCG
>BGI_novel_T016313 Solyc03g025570.2.1
TTCAAGTGTTAGTTTCACATCAT
>BGI_novel_T018109 Solyc03g080075.1.1
GCAAGGGAAAGAAGTATTACTAG
>BGI_novel_T016817 BGI_novel_G001220
GCCCAAGTCATAGGTAGTGCCTG
>BGI_novel_T016141 Solyc03g007600.3.1
ACGTACGTACGTACGTACGTACG
EOF

cat > gene_ids.txt <<EOF
Solyc03g033550.3.1
Solyc03g080075.1.1
Solyc00g256710.2.1
Solyc01g010890.3.1
EOF

# Extract ids and gene ids into a tsv file:
perl -lne '@f = /^>(\S+)\s+(\S+)/ and print join "\t", @f;' in.fasta > ids_gene_ids.tsv

# Select ids that correspond to the desired gene ids:
grep -f gene_ids.txt ids_gene_ids.tsv | cut -f1 > ids.selected.txt

# Extract fasta sequence that correspond to desired gene ids:
seqtk subseq in.fasta ids.selected.txt > out.fasta

cat out.fasta

输出:

>BGI_novel_T016697 Solyc03g033550.3.1
CTGACGTATACAATTAAGCCGCG
>BGI_novel_T018109 Solyc03g080075.1.1
GCAAGGGAAAGAAGTATTACTAG

请注意,可以安装 seqtk,例如,使用 conda

关于linux - 如何通过基因 ID 从 Fasta 文件中检索序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66197008/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com