gpt4 book ai didi

linux - 对于唯一字段 1,折叠另一个字段中的非唯一条目

转载 作者:塔克拉玛干 更新时间:2023-11-02 23:29:24 24 4
gpt4 key购买 nike

我有一个数据集,它是两个数据集的左外连接交集。我现在有来自第一个数据集的多个条目,每个条目都与第二个重叠。请注意 Assembly.1000 重复三次,我想将其折叠成 1

Assembly.1000 chrX 560000 575000 ABC1   20
Assembly.1000 chrX 560000 575000 IL15RA 3.2
Assembly.1000 chrX 560000 575000 BRCA1 20
Assembly.1038 chrX 780000 829000 . .
Assembly.1338 chrX 960000 999000 ACTIN 3800
Assembly.1338 chrX 960000 999000 ACTIN 4000

如您所见,Assembly.1000 的文件 1 条目针对每个文件 2 条目(ABC1、IL15RA、BRCA1)重复了三次

我想将输出解析为什么

Assembly.1000 chrX 560000 575000 ABC1;IL15RA;BRCA1   20;3.2;20
Assembly.1038 chrX 780000 829000 . .
Assembly.1338 chrX 960000 999000 ACTIN,ACTIN 3800;4000

我可以使用 $ while read 命令并查看循环中的先前条目来完成此操作,但对于大文件(~1e6 条目),这根本不够有效。有没有人对有效编程的方法有任何建议?

最佳答案

假设您的 data.frame 被称为“mydf”,定义如下:

mydf <- structure(list(V1 = c("Assembly.1000", "Assembly.1000", 
"Assembly.1000", "Assembly.1038", "Assembly.1338", "Assembly.1338"),
V2 = c("chrX", "chrX", "chrX", "chrX", "chrX", "chrX"),
V3 = c(560000L, 560000L, 560000L, 780000L, 960000L, 960000L),
V4 = c(575000L, 575000L, 575000L, 829000L, 999000L, 999000L),
V5 = c("ABC1", "IL15RA", "BRCA1", ".", "ACTIN", "ACTIN"),
V6 = c("20", "3.2", "20", ".", "3800", "4000")),
.Names = c("V1", "V2", "V3", "V4", "V5", "V6"),
class = "data.frame", row.names = c(NA, -6L))
mydf
# V1 V2 V3 V4 V5 V6
# 1 Assembly.1000 chrX 560000 575000 ABC1 20
# 2 Assembly.1000 chrX 560000 575000 IL15RA 3.2
# 3 Assembly.1000 chrX 560000 575000 BRCA1 20
# 4 Assembly.1038 chrX 780000 829000 . .
# 5 Assembly.1338 chrX 960000 999000 ACTIN 3800
# 6 Assembly.1338 chrX 960000 999000 ACTIN 4000

这是聚合方法:

aggregate(cbind(V5, V6) ~ ., mydf, paste, collapse = "; ")
# V1 V2 V3 V4 V5 V6
# 1 Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2 Assembly.1038 chrX 780000 829000 . .
# 3 Assembly.1338 chrX 960000 999000 ACTIN; ACTIN 3800; 4000

这是“data.table”方法,使用相同的“mydf”作为起点:

library(data.table)
DT <- data.table(mydf)
DT[, lapply(.SD, paste, collapse = "; "), by = c("V1", "V2", "V3", "V4")]
# V1 V2 V3 V4 V5 V6
# 1: Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2: Assembly.1038 chrX 780000 829000 . .
# 3: Assembly.1338 chrX 960000 999000 ACTIN; ACTIN 3800; 4000

关于linux - 对于唯一字段 1,折叠另一个字段中的非唯一条目,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19131166/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com