gpt4 book ai didi

用于在文件中剪切/粘贴字符串的 awk 脚本

转载 作者:行者123 更新时间:2023-12-03 20:23:27 25 4
gpt4 key购买 nike

我得到了一个这样格式的文件:(每个空格 = 制表符分隔符)

NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
我想剪切/粘贴行尾的 :AATGT+GTGTA 部分,并使用制表符分隔符来获取
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
重要精度:我希望第一个实例的最后一个 ':' 之后的最后一个字符串被复制粘贴,(包括 ':'),而不管字符串的大小(它可以是 AAAA 或 AAAA+GGGG 等)。 )
我使用了以下 awk 脚本:
awk '/^@/ {print;next} {N=split($1,n,":"); print $0 "\tRX:Z:" n[N] ; sub("[:]"n[N],"") ; print $0}'
我的问题是原来的行仍然存在,所以我得到了这个结果
NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
基本上我不知道如何使用 awk 在新文件中重定向结果(或覆盖原始文件)。 bash 脚本对我来说也是一个很好的解决方案。谢谢你的帮助
编辑:忘记提及我必须排除以 @ 开头的行:脚本不应应用于那些行。 (这是NGS数据的bam文件,标题行不应更改)
该文件看起来像这样
@SQ     SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507:CACTC-CCGTC 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38
NB551027:724:HTWHHAFXY:2:11110:2230:8695:AGTCT-AAAGT 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113
我应该得到这个结果
@SQ     SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113 RX:Z:AGTCT-AAAGT

最佳答案

$ awk '{x=$1; sub(/.*:/,"",x); sub(/:[^:\t]*\t/,"\t"); print $0 "\tRX:Z:" x}' file
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA

关于用于在文件中剪切/粘贴字符串的 awk 脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66066371/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com