gpt4 book ai didi

regex - 两个逗号分隔的字符串之间的 R 匹配

转载 作者:行者123 更新时间:2023-12-04 13:48:55 26 4
gpt4 key购买 nike

我试图找到一种优雅的方法来查找数据框中以下两个字符列之间的匹配项。复杂的部分是任何一个字符串都可以包含一个逗号分隔的列表,如果一个列表的成员与另一个列表的任何成员匹配,那么整个条目将被视为匹配。我不确定我对此的解释有多好,所以这里是示例数据和输出:

替代 1:

  • AT
  • G
  • CGTCC,AT
  • CGC

  • 替代 2:
  • AA
  • GG
  • AT,GGT
  • CG

  • 每行预期匹配:
  • 第 1 行 = 无
  • 第 2 行 = A
  • 第 3 行 = 无
  • 第 4 行 = AT
  • 第 5 行 = 无

  • 非工作解决方案:

    第一次尝试:按所需列合并整个数据框,然后匹配上面显示的替代列:
    match1 = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end",  "ref")))
    matches = unique(match1[unlist(sapply(match1$Alt1 grep, match1$Alt2, fixed=TRUE)),])

    第二种方法,使用来自 VariantAnnoatation/Granges 的 findoverlaps 功能:
    findoverlaps(ranges(vcf1), ranges(vcf2)) 

    任何建议将不胜感激!谢谢!

    解决方案
    感谢@Marat Talipov 在下面的回答,以下解决方案可以比较两个逗号分隔的字符串:
    > ##read in edited kaviar vcf and human ref
    > ref <- readVcfAsVRanges("ref.vcf.gz", humie_ref)
    Warning message:
    In .vcf_usertag(map, tag, ...) :
    ScanVcfParam ‘geno’ fields not present: ‘AD’

    > ##rename chromosomes to match with vcf files
    > ref <- renameSeqlevels(ref, c("1"="chr1"))

    > ##################################
    > ## Gather VCF files to process ##
    > ##################################
    > ##data frame *.vcf.gz files in directory path
    > vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$", full=TRUE))

    > ##read in everything but sample data for speediness
    > vcf_param = ScanVcfParam(samples=NA)
    > vcf <- readVcfAsVRanges("test.vcf.gz", humie_ref, param=vcf_param)

    > #################
    > ## Match SNP's ##
    > #################
    > ##create data frames of info to match on
    > vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf), end = end(vcf), ref = as.character(ref(vcf)),
    + alt=alt(vcf), stringsAsFactors=FALSE)
    > ref.df = data.frame(chr =as.character(seqnames(ref)), start = start(ref), end = end(ref),
    + ref = as.character(ref(ref)), alt=alt(ref), stringsAsFactors=FALSE)
    >
    > ##merge based on all positional fields except vcf
    > col_match = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))

    > library(stringi)
    > ##split each alt column by comma and bind together
    > M1 <- stri_list2matrix(sapply(col_match$alt.x,strsplit,','))
    > M2 <- stri_list2matrix(sapply(col_match$alt.y,strsplit,','))
    > M <- rbind(M1,M2)

    > ##compare results
    > result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))

    > ##add results column to col_match df for checking/subsetting
    > col_match$match = result
    > head(col_match)
    chr start end ref alt.x alt.y match
    1 chr1 39998059 39998059 A G G G
    2 chr1 39998059 39998059 A G G G
    3 chr1 39998084 39998084 C A A A
    4 chr1 39998084 39998084 C A A A
    5 chr1 39998085 39998085 G A A A
    6 chr1 39998085 39998085 G A A A

    最佳答案

    如果输入列表的长度相等,并且您想以成对的方式比较列表元素,则可以使用以下解决方案:

    library(stringi)

    M1 <- stri_list2matrix(sapply(Alt1,strsplit,','))
    M2 <- stri_list2matrix(sapply(Alt2,strsplit,','))
    M <- rbind(M1,M2)

    result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))

    样本输入:
    Alt1 <- list('AT','A','G','CGTCC,AT','CGC','GG,CC')
    Alt2 <- list('AA','A','GG','AT,GGT','CG','GG,CC')

    输出:
    # [[1]]
    # character(0)
    #
    # [[2]]
    # [1] "A"
    #
    # [[3]]
    # character(0)
    #
    # [[4]]
    # [1] "AT"
    #
    # [[5]]
    # character(0)
    #
    # [[6]]
    # [1] "GG" "CC"

    关于regex - 两个逗号分隔的字符串之间的 R 匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28590469/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com