I have a data frame in R with three columns: Gene.ID, source, and value. I need to filter the rows based on multiple conditions, but I'm having trouble achieving the desired result. Here's a sample of my data:
My goal is to:
我在R中有一个数据框,它有三列:Gene.ID、SOURCE和VALUE。我需要基于多个条件来筛选行,但我很难获得所需的结果。以下是我的数据样本:我的目标是:
Keep rows with the same Gene.ID and source.
For rows with the same Gene.ID but different source, I want to keep them only if the value is different from the previous row.
I've tried various approaches using dplyr and custom loops, but I haven't been able to achieve the desired filtering logic.
保留具有相同Gene.ID和源的行。对于具有相同Gene.ID但来源不同的行,我希望仅当值与前一行不同时才保留它们。我尝试了使用dplyr和定制循环的各种方法,但未能实现所需的过滤逻辑。
Can someone provide a solution or suggest an efficient way to filter this data frame based on these conditions?
是否有人可以提供解决方案或建议一种有效的方法来根据这些条件过滤此数据框?
Thank you for your assistance!
感谢您的帮助!
df <- data.frame(
Gene.ID = c(
"NZ_JAHWGH010000001.1_15",
"NZ_JAHWGH010000001.1_17",
"NZ_JAHWGH010000001.1_68",
"NZ_JAHWGH010000001.1_7"
),
HMMER = c(
"SLH",
"GT2",
"GT2",
"GH13+CBM41+CBM41+GH13"
),
dbCAN_sub = c(
"",
"GT2",
"GT2",
"CBM41+GH13+CBM41+CBM41+CBM48+GH13"
),
DIAMOND = c(
"",
"",
"GT2",
"CBM41+CBM48+GH13+GH13+GH11"
),
stringsAsFactors = FALSE
)
my desired output will be as followes
我想要的输出如下
df_output <- data.frame(
Gene.ID = c(
"NZ_JAHWGH010000001.1_15",
"NZ_JAHWGH010000001.1_17",
"NZ_JAHWGH010000001.1_68",
"NZ_JAHWGH010000001.1_7",
"NZ_JAHWGH010000001.1_7",
"NZ_JAHWGH010000001.1_7",
"NZ_JAHWGH010000001.1_7",
"NZ_JAHWGH010000001.1_7",
"NZ_JAHWGH010000001.1_7",
"NZ_JAHWGH010000001.1_7"
),
combined = c(
"SLH",
"GT2",
"GT2",
"CBM41",
"GH13",
"CBM41",
"CBM41",
"CBM48"
"GH13",
"GH11"
),
stringsAsFactors = FALSE
)
I tried with this command But I didnt get, desired output
我尝试使用此命令,但没有得到所需的输出
df_output <- df %>%
separate_rows(., sep = "\\+") %>%
gather(key = "source", value = " ", -Gene.ID) %>%
filter(combined != "") %>%
distinct(Gene.ID, combined)
更多回答
(1) Is separate_rows
actually doing anything for you? For me, it's doing nothing, and will likely have problems if you use it correctly since the fields to be separated have different lengths of values to separate. (2) tidyr::gather
was superseded years ago, I strongly suggest you learn to use tidyr::pivot_longer
, it is far more powerful. (3) Perhaps you should pivot_longer
(or gather
) and then separate
. (4) Why create a column with a name of " "
? That seems like making future operations on it rather difficult.
(1)独立行实际上为您做了什么吗?对我来说,它什么都不做,如果您正确使用它,可能会有问题,因为要分隔的字段有不同长度的值要分隔。(2)tidyr::Gather几年前就被取代了,我强烈建议您学习使用tidyr::Pivot_Long,它的功能要强大得多。(3)也许你应该更长时间地旋转(或聚集),然后分开。(4)为什么要创建一个名称为“”的列?这似乎使未来对其进行操作变得相当困难。
(5) "previous row" is really fragile, since I don't know that reshaping is going to guarantee everything is in the order you expect.
(5)“前一排”真的很脆弱,因为我不知道重塑就能保证一切都按你预期的顺序进行。
Lastly, it might clear up a lot of questions if you included what the output should be here. (Providing it as a frame is incredibly helpful, even if you have to create it manually.)
最后,如果您在这里包含了输出应该是什么,可能会澄清很多问题。(将其作为框架提供是非常有用的,即使您必须手动创建它。)
A working version following your approach replacing some of the tidyr
functions with more recent versions:
遵循您的方法的工作版本用更新的版本替换了一些tidyr函数:
library(dplyr)
library(tidyr)
df %>%
tidyr::separate_longer_delim(HMMER, delim = "+") %>%
tidyr::separate_longer_delim(dbCAN_sub, delim = "+") %>%
tidyr::separate_longer_delim(DIAMOND, delim = "+")%>%
tidyr::pivot_longer(-Gene.ID, values_to = "combined") %>%
dplyr::select(Gene.ID, combined) %>%
dplyr::filter(combined != "") %>%
dplyr::distinct()
# A tibble: 7 x 2
Gene.ID combined
<chr> <chr>
1 NZ_JAHWGH010000001.1_15 SLH
2 NZ_JAHWGH010000001.1_17 GT2
3 NZ_JAHWGH010000001.1_68 GT2
4 NZ_JAHWGH010000001.1_7 GH13
5 NZ_JAHWGH010000001.1_7 CBM41
6 NZ_JAHWGH010000001.1_7 CBM48
7 NZ_JAHWGH010000001.1_7 GH11
Note that this might create very long intermediate dataframes if run with a larger initial dataframe, as the separate_longer
steps create many rows with redundant content which are only dropped again at the end (in this example 4 -> 359 -> 7 rows). There is probably a more efficient way to do this for large dataset.
请注意,如果使用较大的初始数据帧运行,这可能会创建非常长的中间数据帧,因为Separate_Long步骤创建了许多具有冗余内容的行,这些内容只会在结束时再次删除(在本例中为4->359->7行)。对于大型数据集,可能有一种更有效的方法来实现这一点。
更多回答
@Umar I answered to your provided code example and the expected output, but I've realized that the title and first paragraph of the question seem to describe a different data structure / question. If the answer is off topic I can delete it.
@Umar我回答了您提供的代码示例和预期输出,但我意识到问题的标题和第一段似乎描述了不同的数据结构/问题。如果答案偏离了主题,我可以删除它。
Thank you, but , it didnt meet the answer, I did it using loop, i was busy with my manuscript, after this, so I couldnt see your reply
谢谢,但是,它不符合答案,我是用循环做的,我忙着写稿子,在这之后,我看不到你的回复
我是一名优秀的程序员,十分优秀!