gpt4 book ai didi

用于从其他值列相同的一列中提取值的sql查询

转载 作者:行者123 更新时间:2023-11-29 14:21:46 25 4
gpt4 key购买 nike

我有一个数据集

论文作者:

paper_id author_id
1 521630
1 972575
1 1528710
1 1611750
2 1682088
2 1589667
2 972575
3 521630
3 1589667

我需要获取所有作者的合著者信息,即作者与一位作者撰写同一篇论文,从而按作者分组

author_id  co_authors
521630 972575,1528710,1611750,1589667
972575 521630,1528710,1611750,1589667
.......

第一行author_id=521630 author_id=972575,1528710,1611750写了paper1,author_id=1589667写了paper 3.我写了一个R代码

co_author_id<-vector()
for(i in 1:length(paper_author))
{
author_id_data<-paper_author[i,2]
index1<-which(paper_author$author_id %in% author_id_data
paper_ids<-paper_author$paper_id[index1]
index2<-which(paper_author$paper_id %in% paper_ids)
co_authors<-paper_author$author_id[index2]
co_author_id[i]<-paste(co_authors,collapse=" ")
}

但这是非常低效的,因为数据大小是 1200 万行,因此使用 sql 会很容易而且很好。如何使用 sql 完成

谢谢

最佳答案

问题有一个 R 标签,所以我假设需要 R 解决方案:

sqldf

library(sqldf)
nr <- nrow(paper_author)
paper_author$seqno <- ave(1:nr, paper_author$paper_id, FUN = seq_along)


sqldf(c("create index i2 on paper_author(paper_id, seqno)",
"select author_id, group_concat(coauthor) co_authors
from (
select distinct A.author_id, C.author_id coauthor
from ( select * from main.paper_author where seqno = 1) A
left join (select * from main.paper_author where seqno > 1) C
using (paper_id)
) group by author_id"))

数据表

library(data.table)
dt <- data.table(paper_author, key = "paper_id")
dt[, seqno:=1:.N, by = paper_id]
m <- merge(dt[seqno == 1], dt[seqno > 1], all.x = TRUE, by = "paper_id")
unique(m[, list(author_id.x, author_id.y)])[,
list(co_authors = toString(author_id.y)), by = author_id.x]

dplyr

library(dplyr)
gp <- paper_author %.% group_by(paper_id)
gp %.%
filter(row_number() == 1) %.%
left_join( gp %.% filter(row_number() > 1), by = "paper_id" ) %.%
ungroup() %.%
select(author_id.x, author_id.y) %.%
unique() %.%
group_by(author_id.x) %.%
summarise(co_authors = toString(author_id.y))

基础 R

nr <- nrow(paper_author)
seqno <- ave(1:nr, paper_author$paper_id, FUN = seq_along)
m <- merge(paper_author[seqno == 1, ],
paper_author[seqno > 1, ], all.x = TRUE, by = "paper_id")
u <- unique(m[c("author_id.x", "author_id.y")])
aggregate(list(co_authors = u$author_id.y), list(author = u$author_id.x), toString)

试试上面的:

paper_author <-
structure(list(paper_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L
), author_id = c(521630L, 972575L, 1528710L, 1611750L, 1682088L,
1589667L, 972575L, 521630L, 1589667L), seqno = c(1L, 2L, 3L,
4L, 1L, 2L, 3L, 1L, 2L)), .Names = c("paper_id", "author_id",
"seqno"), row.names = c(NA, -9L), class = "data.frame")

已修订 已修订以使作者在输出中独一无二。

关于用于从其他值列相同的一列中提取值的sql查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22780883/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com