gpt4 book ai didi

重新排列混合输入数据框

转载 作者:行者123 更新时间:2023-12-04 10:07:41 25 4
gpt4 key购买 nike

我有一个不一致的输入数据框。在这里。

df <- structure(list(Gene = c("k141_1305_1", "k141_1406_2", "k141_1406_3", 
"k141_6669_1", "k141_9215_1", "k141_13242_1", "k141_13333_5",
"k141_17708_1", "k141_19670_1", "k141_19670_6"), Phylum = c("p__Actinobacteria",
"p__Firmicutes", "p__Firmicutes", "p__Cyanobacteria", "p__Actinobacteria",
"p__Actinobacteria", "p__Firmicutes", "p__Firmicutes", "p__Actinobacteria",
"p__Proteobacteria"), Class = c("c__Actinobacteria", "c__Clostridia",
"c__Clostridia", "o__Nostocales", "c__Actinobacteria", "c__Actinobacteria",
"c__Clostridia", "c__Bacilli", "c__Actinobacteria", "c__Gammaproteobacteria"
), Order = c("o__Pseudonocardiales", "o__Clostridiales", "o__Clostridiales",
"f__Hapalosiphonaceae", "o__Pseudonocardiales", "o__Pseudonocardiales",
"o__Clostridiales", "o__Bacillales", "o__Pseudonocardiales",
"o__Pseudomonadales"), Family = c("f__Pseudonocardiaceae", "f__Lachnospiraceae",
"f__Lachnospiraceae", "g__Fischerella", "f__Pseudonocardiaceae",
"f__Pseudonocardiaceae", "f__Clostridiales Family XIII. Incertae Sedis",
"g__Exiguobacterium", "f__Pseudonocardiaceae", "f__Pseudomonadaceae"
), Genus = c("g__Pseudonocardia", "s__Lachnospiraceae bacterium 10-1",
"s__Lachnospiraceae bacterium 10-1", "s__Fischerella muscicola",
"g__Pseudonocardia", "g__Pseudonocardia", "s__[Eubacterium] infirmum",
"s__Exiguobacterium enclense", "g__Pseudonocardia", "g__Pseudomonas"
), Species = c("s__Pseudonocardia sp. Ae331_Ps2", "unknown",
"unknown", "unknown", "s__Pseudonocardia sp. Ae331_Ps2", "s__Pseudonocardia sp. Ae331_Ps2",
"unknown", "unknown", "s__Pseudonocardia ammonioxydans", "s__Pseudomonas aeruginosa group"
)), .Names = c("Gene", "Phylum", "Class", "Order", "Family",
"Genus", "Species"), row.names = c(3212L, 3853L, 3854L, 17967L,
24006L, 34126L, 34325L, 43722L, 49328L, 49332L), class = "data.frame")

数据框看起来像这样

 Gene            Phylum             Class                Order                Family
3212 k141_1305_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
3853 k141_1406_2 p__Firmicutes c__Clostridia o__Clostridiales f__Lachnospiraceae
3854 k141_1406_3 p__Firmicutes c__Clostridia o__Clostridiales f__Lachnospiraceae
17967 k141_6669_1 p__Cyanobacteria o__Nostocales f__Hapalosiphonaceae g__Fischerella
24006 k141_9215_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
34126 k141_13242_1 p__Actinobacteria c__Actinobacteria o__Pseudonocardiales f__Pseudonocardiaceae
Genus Species
3212 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2
3853 s__Lachnospiraceae bacterium 10-1 unknown
3854 s__Lachnospiraceae bacterium 10-1 unknown
17967 s__Fischerella muscicola unknown
24006 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2
34126 g__Pseudonocardia s__Pseudonocardia sp. Ae331_Ps2

如您所见,数据框没有按照应有的结构进行构建。数据框是以这种方式生成的,所以我无法控制它。

问题是微生物应该使用不同的等级进行注释(从 Pylum 到物种,每列一个)。如您所见,在某些情况下缺少等级,例如 Gene 17967(第 4 行)没有等级等级(没有“c__”注释)。发生的情况是,在列类中,这个特定的分类单元具有顺序 ("o__Nostocales") 而不是空的 "c__"注释。其他情况也是如此,例如第 2 行没有属“g__”注释,因此该物种被放在属列中。

第一行和最后两行是它应该如何的示例。

是否有机会快速更正这些行,以便每个列都有相应的分类等级。 ???

例如,如果我取第二行,正确的输出应该是:

Gene            Phylum             Class                Order                Family 
3853 k141_1406_2 p__Firmicutes c__Clostridia o__Clostridiales f__Lachnospiraceae
Genus Species
3853 g__ s__Lachnospiraceae bacterium 10-1

或者它可能是一个未知的 g__unknown 标签。

3853   k141_1406_2     p__Firmicutes     c__Clostridia     o__Clostridiales    f__Lachnospiraceae
3853 g__unknown s__Lachnospiraceae bacterium 10-1

最佳答案

尝试使用这段代码:

 adds=function(x){
nam=c("k","p","c","o","f","g","s")
l=which(is.na(match(nam,substr(x,1,1))));
if(length(l)>0)`names<-`(head(unlist(append(x,paste0(nam[l],"__"),l-1)),-1),names(x))
else x
}

data.frame(t(apply(df,1,adds)))

这应该能够将所需的名称附加到行中。因此给出了预期的结果。如果这有帮助,请告诉我们。谢谢。

关于重新排列混合输入数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46032777/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com