gpt4 book ai didi

r - 查找两个矩阵中序列之间的部分重叠

转载 作者:行者123 更新时间:2023-12-01 13:48:14 26 4
gpt4 key购买 nike

我将从可重现的示例开始,它是我真实数据的一部分:

数据文件1:

> dput(exp_data)
structure(c("ACLVDGSYHDVDSSVLAFQLAAR", "AELNQVVR", "AFEPGLLAK",
"AFSVFLFNSK", "AFYEFQQR", "AGEPLYVLLCCWVAAVGAGLLK", "AIKDFPHR",
"AIRIPVVR", "AIVWSGEELGAK", "ALAALQGR", "ALEGIYACCFR", "ANLSSVQIDR",
"ANLSSVQIDRELK", "ASYTMQLAK", "ATRVEEGGEEENVMAK", "AVELVILPR",
"AVPLKDYR", "CLAAIEGR", "DIVSEHPER", "DLVDFAEFR", "DLVDFAEFRK",
"DMIVTNLGAKPLVLQIPIGAEDVFK", "DQSDREVDVTQNR", "DQVSIIPFR", "DQVSIIPFRGDAAEVLLPPSR",
"DQVTAEDVGIVIPNCLR", "DRVTPDDVATVIPNCLR", "DSILQSIHEPELISAFDTGGAELLYEIR",
"DSLVQSGAKPELIAAFDTNGAELLYEIR", "DTITGETLSDPENPVVLER", "EDGVMTAELLQR",
"EGISISHPAR", "EIGGIAISGR", "EILVQHLLVK", "ELHGESEEERVKEEEIK",
"23", " 8", " 9", "10", " 8", "22", " 8", " 8", "12", " 8", "11",
"10", "13", " 9", "16", " 9", " 8", " 8", " 9", " 9", "10", "25",
"13", " 9", "21", "17", "17", "28", "28", "19", "12", "10", "10",
"10", "17"), .Dim = c(35L, 2L), .Dimnames = list(c("14037", "24071",
"27989", "31522", "32851", "35458", "49646", "52332", "54727",
"57052", "61034", "82744", "82797", "104573", "110271", "115602",
"121061", "133577", "163666", "175488", "175522", "177867", "183262",
"183690", "183703", "183742", "184949", "186146", "186828", "193019",
"213233", "222624", "232405", "233822", "244244"), c("Sequence",
"Length")))

数据文件 2:
> dput(exp_sel)
structure(c(" 49", " 80", " 45", " 61", " 40", " 45", "107",
" 75", " 40", " 60", " 43", " 57", " 80", " 51", " 55", " 39",
"MAMTPVASSSPVSTCRLFRCNLLPDLLPKPLFLSLPKRNRIASCRFTVR", "MAADALRISSSSSGSLVCNLNGSQRRPVLLPLSHRATFLGLPPRASSSSISSSIPQFLGTSRIGLGSSKLSQKKKQFSVF",
"MSASSLFNLPLIRLRSLALSSSFSSFRFAHRPLSSISPRKLPNFR", "MFSLKSLISSPFTQSTTHGLFTNPITRPVNPLPRTVSFTVTASMIPKRSSANMIPKNPPAR",
"MQICQTKLNFTFPNPTNPNFCKPKALQWSPPRRISLLPCR", "MVVVTHISTSFHQISPSFFHLRLRNPSTTSSSRPKLDGGFALSIR",
"MASSSSMQMVHTSRSIAQIGFGVKSQLVSANRTTQSVCFGARSSGIALSSRLHYASPIKQFSGVYATTKHQRTACVKSM",
"MELSLLRPTTQSLLPSFSKPNLRLAELNQVVRLRC", "MASSSLPLSLPFPLRSLTSTTRSLPFQCSPLFFSIPSSIV",
"MASLLGTSSSAIWASPSLSSPSSKPSSSPICFRPGKLFGSKLNAGIQIRPKKNRSRYHVS",
"MALQAADLVDFAEFRRKDAKLNASSSSFKDSSLFGASITDQIKSEHGSSSLRFKREQSLRNLAIRA",
"MELSLSTSSASPAVLRRQASPLLHKQQVLGVSFASALKPASYTMQLAKSRRPLPRPITC",
"MFRVTGTLSAASSPAVAAASFSAALRLSITPTLAIASPPHLRWFSKFSRQFLGGRISSLRPRIPSPCPIRLSGFPALKMRA",
"MLSLTATTLSSSIFTQSKTHGFFNTRPVYRKPFTTITSALIPASNRQAPPK", "MASLLGRSPSSILTCPRISSPSSTSSMSHLCFGPEKLSGRIQFNPKKNRSRYHVS",
"MAVSPHISPTLSRYKFFSTSVVENPNFSPYRIYSRRRVT"), .Dim = c(16L, 2L), .Dimnames = list(
c("2", "6", "10", "11", "14", "15", "16", "17", "20", "21",
"22", "23", "24", "25", "26", "27"), c("Length", "Sequence"
)))

我想选择 Sequence从数据文件 1 (exp_data) 的每一行中,尝试查找是否可以在列 Sequence 的任何行中找到此特定字符串来自数据文件 2 (exp_sel)。问题是这些序列并不相同,只有来自数据文件 1 的序列的部分重叠预计会出现在来自数据文件 2 的序列列中。

示例输出:

数据文件 1 中的序列:

AFYEFQQR

数据文件 2 中的序列:

MAMTPVASSSPV AFYEFQQR NLLPDLLPKPLFLSLPKRNRIASCRFTVR

存在匹配项,因此请将此行保留在 exp_data 中。如果这个序列没有太多 - 删除这一行。

最佳答案

你可以这样做...

exp_data[sapply(exp_data[,1], function(x) any(grepl(x, exp_sel[,2]))), ]

Sequence Length
24071 "AELNQVVR" " 8"
104573 "ASYTMQLAK" " 9"
175488 "DLVDFAEFR" " 9"
sapply产生一个逻辑向量 TRUE如果有任何 exp_sel values 包含 exp_data 的适当元素.

关于r - 查找两个矩阵中序列之间的部分重叠,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50238749/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com