gpt4 book ai didi

r - 用两个或多个 csv 文件之间的间隔对匹配值(完整行)进行子集化

转载 作者:行者123 更新时间:2023-12-02 04:20:03 26 4
gpt4 key购买 nike

我有两个矩阵(A 和 B)。我正在尝试使用间隔值对 B 中的匹配行进行子集化。例如,

矩阵A包含(我有200多个化合物)

Name   Mass.    RT.   Area.  ID

Asa. 234.032 1.56. 6755. Sd323

bda 164.041. 4.48. 5353. SD424

dsf. 353.953. 6.53. 2535. SD422

fed. 535.535. 5.14. 4542 SD424

矩阵 B 包含(类似原始矩阵或 CSV 包含 5000 个化合物)

Name. mass.      RT     Area. chemID pubID score

csa. 234.031 1.56. 4354. frsg. gss. 90

bda. 164.041. 4.78. 4346. gsdg gsf. 80

dwf. 432.035. 9.84. 4245. grhr. hfg. 99

fsf. 535.042. 7.01. 5353. heth. gww. 90

现在我想使用 Mass ± 0.001 和 RT ± 0.5 间隔对矩阵 B 中的匹配化合物进行子集化,最终矩阵看起来像

Name. mass.      RT     Area. chemID pubID score

csa. 234.031 1.56. 4354. frsg. gss. 90

bda. 164.041. 4.78. 4346. gsdg gsf. 80

我尝试在 R 中使用以下命令,但效果不佳。非常感谢任何帮助。

#Read in first table
fname = "A.csv"
df1 = read.csv(fname)
# Read in the second table
fname = "B.xlsx"
df2 = read_excel(fname, skip=4)
# Create an empy dataframe
new_df = setNames(data.frame(matrix(ncol = ncol(df2), nrow = 0)), colnames(df2))
# Set the threshold for the mass and the retention time
m_ths = 1.e-3 # Mass threshold
rt_ths = 0.5 # Retention time threshold
# Loop over the indices of one of the data frames
for (i in 1:nrow(df1)) {
# Get the mass and retention time of the current row
m = df1$Mass[i]
rt = df1$RT[i]
# Get boolean vectors of rows within the second table that are within the
# given tolerance of the current mass (m) and retention time (rt)
m_cond = df2$Mass >= m-m_ths & df2$Mass <= m+m_ths
rt_cond = df2$RT >= rt-rt_ths & df2$RT <= rt + rt_ths
# Get the subset of rows in second table that meet the required conditions
tmp_df = subset(df2, m_cond & rt_cond)
if (nrow(tmp_df) > 0) {
# If the new table is not empty add it to the empty new_df data frame
tmp_df$mb_data_index = i
new_df = rbind(new_df, tmp_df)
}
}
write.csv(new_df, "commoncompounds.csv")

最佳答案

代码:

library('data.table')
# join two data tables and get only the matching rows by Name
df3 <- setDT(df2)[df1, on = 'Name', nomatch = 0]
# subset based on conditions of Mass and RT
df3 <- df3[ (round(abs(Mass - i.Mass), 3) <= 0.001) &
(round(abs(RT - i.RT), 1) <= 0.5), ]
# remove columns of df1
df3[, `:=` (i.Mass = NULL, i.RT = NULL, i.Area = NULL, ID = NULL)]
df3
# Name Mass RT Area chemID pubID score
# 1: Asa 234.031 1.56 4354 frsg gss 90
# 2: bda 164.041 4.78 4346 gsdg gsf 80

数据:

df1 <- read.table(text = 
'Name Mass RT Area ID
Asa 234.032 1.56 6755 Sd323
bda 164.041 4.48 5353 SD424
dsf 353.953 6.53 2535 SD422
fed 535.535 5.14 4542 SD424', header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = 'Name Mass RT Area chemID pubID score
Asa 234.031 1.56 4354 frsg gss 90
bda 164.041 4.78 4346 gsdg gsf 80
dwf 432.035 9.84 4245 grhr hfg 99
fsf 535.042 7.01 5353 heth gww 90', header = TRUE, stringsAsFactors = FALSE)

关于r - 用两个或多个 csv 文件之间的间隔对匹配值(完整行)进行子集化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60936722/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com