gpt4 book ai didi

r - 在R中以记录格式打开txt文件

转载 作者:行者123 更新时间:2023-12-02 03:36:01 24 4
gpt4 key购买 nike

我希望将 R 中记录格式的 txt 文件作为数据框读取,其中每一行对应一条记录。记录长短不一。知道我该怎么做吗?

这是第一条记录:

# C. elegans orthologs      
# WormBase version: WS241
# Generated:
# File is in record format with records separated by "=\n"
# Sample Record
# WBGeneID \t PublicName \n
# Species \t Ortholog \t MethodsUsedToAssignOrtholog \n
# BEGIN CONTENTS
=
WBGene00000001 aap-1
Ascaris suum GS_11030 WormBase-Compara
Brugia malayi WBGene00227541 WormBase-Compara
Bursephelenchus xylophilus BUX.s00055.227 WormBase-Compara
Caenorhabditis angaria Cang_2012_03_13_00205.g6964.t3 WormBase-Compara
Caenorhabditis brenneri WBGene00194098 TreeFam; WormBase-Compara
Caenorhabditis briggsae WBGene00032086 Hillier-set; OrthoMCL; Inparanoid_7; OMA; WormBase-Compara
Caenorhabditis japonica WBGene00207613 WormBase-Compara
Caenorhabditis remanei WBGene00069407 Inparanoid_7; OMA; TreeFam; WormBase-Compara
Caenorhabditis sp.11 Csp11.Scaffold542.g3421.t1 WormBase-Compara
Caenorhabditis sp.5 Csp5_scaffold_00676.g14307.t1 WormBase-Compara
Danio rerio ENSEMBL:ENSDARP00000056212 TreeFam
Dirofilaria immitis nDi.2.2.2.t01810 WormBase-Compara
Drosophila melanogaster ENSEMBL:FBpp0303635 EnsEMBL-Compara; TreeFam
Haemonchus contortus HCOI02027400.t1 WormBase-Compara
Heterorhabditis bacteriophora Hba_15363 WormBase-Compara
Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam
Loa loa EFO26046.2 WormBase-Compara
Meloidogyne hapla MhA1_Contig1573.frz3.gene15 WormBase-Compara
Mus musculus ENSEMBL:ENSMUSP00000034296 EnsEMBL-Compara; TreeFam
Onchocerca volvulus WBGene00241206 WormBase-Compara
Panagrellus redivivus Pan_g2405.t1 WormBase-Compara
Pristionchus pacificus WBGene00117228 Inparanoid_7; OMA; WormBase-Compara
Trichinella spiralis EFV56516 WormBase-Compara
=
WBGene00000002 aat-1
Ascaris suum GS_20881 WormBase-Compara

编辑:我真正需要的是每条记录中与“智人”对应的条目。所以,理想情况下,我在 R 中的 df 是:

WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam 
WBGene00000002 aat-1 etc etc

最佳答案

我建议使用 readLines 将数据读入 R。由于您在注释中给了我们文件路径,因此请先使用 file 打开与文件的连接,然后是 readLines。在我们读取数据并将数据存储到 R 之后,关闭 连接始终是一个好习惯。

> con <- file("../Input/c_elegans.PRJNA13758.current.best_blastp_hits.txt", 
open = "r")
> XX <- readLines(con)
> close(con)

> record <- grep("^WBGene", XX, value = TRUE)
> sapien <- grep("Homo sapiens", XX, value = TRUE, fixed = TRUE)
> gsub("\\s+", " ", paste0(record[1], sapien))
## [1] "WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam"

样本数据的整个record 向量是

> record
## [1] "WBGene00000001 aap-1 " "WBGene00000002 aat-1 "

所以当我们找到记录 2 的智人时,它将被粘贴到记录 2,智人 3 到记录 3,依此类推

paste0(record, sapien)

值得注意OP的数据框最终是用

创建的
do.call(rbind, strsplit(paste0(record, sapien), split = "\\s+"))

关于r - 在R中以记录格式打开txt文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23415618/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com