gpt4 book ai didi

r - 当数据不在表中时,如何将文本文件读入 R

转载 作者:行者123 更新时间:2023-12-04 12:34:09 26 4
gpt4 key购买 nike

我有一个很长的电话日志作为文本文件,我试图将它读入 R 但它并没有真正奏效。文本具有结构,但它肯定不是表格。其结构如下

  • 每条记录由多行组成,所以 readLines 不太合适
  • 每条记录的每一行都是一个单独的字段
  • 某些记录在第二个字段之后有一个额外的字段
  • 每条新记录都用空行标注。如果可以指定记录由“\n\n”分隔并且字段(或列)由“\n”分隔,那么 readLinesscan 会起作用

  • 下面是一个例子:
    TheInstitute 5467
    telephone line 4125526987 x 4567
    datetime 2011110516 12:56
    blay blay blah who knows what, but anyway it may have a comma

    TheInstitute 5467
    telephone line 4125526987 x 4567
    datetime 2011110516 12:58
    blay blay blah who knows what

    TheInstitute 5467
    telephone line 412552999 x 4999
    bump phone line 4125527777
    datetime 2011110516 12:59
    blay blay blah who knows what

    TheInstitute 5467
    telephone line 4125526987 x 4567
    bump phone line 4125527777
    datetime 2011110516 13:51
    blay blay blah who knows what, but anyway it may have a comma

    TheInstitute 5467
    telephone line 4125526987 x 4567
    datetime 2011110516 14:56
    blay blay blah who knows what

    我怎样才能在 R 中做到这一点?我尝试过扫描、粘贴、strsplit 等技巧,但我一直在绕圈子旋转。我可能必须将它放入一个列表中,因为它可以处理不等数量的元素。我想让所有记录都具有相同数量的字段,对于那些没有一个字段的记录(这里称为凹凸电话),我希望它们只是将 NA 作为该字段中的值。即使只是开始,我也将不胜感激。从那里我可以玩耍和玩具。

    最佳答案

    scan 函数中的 multi.line = TRUE 时,记录应以两个行尾结束。我在您的文件周围使用 textConnection 执行此操作,但您将使用有效的文件名:

    inp <- scan(textConnection(txt), multi.line=TRUE, 
    what=list(place="character", tline1="character",
    cline1="character", cline2 ="character", cline3="character"), sep="\n")
    Read 5 records
    > str(as.data.frame(inp))
    'data.frame': 5 obs. of 5 variables:
    $ place : Factor w/ 1 level "TheInstitute 5467": 1 1 1 1 1
    $ tline1: Factor w/ 2 levels " telephone line 4125526987 x 4567",..: 1 1 2 1 1
    $ cline1: Factor w/ 4 levels " bump phone line 4125527777",..: 2 3 1 1 4
    $ cline2: Factor w/ 4 levels " blay blay blah who knows what",..: 2 1 3 4 1
    $ cline3: Factor w/ 3 levels ""," blay blay blah who knows what",..: 1 1 2 3 1
    > as.data.frame(inp)
    place tline1
    1 TheInstitute 5467 telephone line 4125526987 x 4567
    2 TheInstitute 5467 telephone line 4125526987 x 4567
    3 TheInstitute 5467 telephone line 412552999 x 4999
    4 TheInstitute 5467 telephone line 4125526987 x 4567
    5 TheInstitute 5467 telephone line 4125526987 x 4567
    cline1
    1 datetime 2011110516 12:56
    2 datetime 2011110516 12:58
    3 bump phone line 4125527777
    4 bump phone line 4125527777
    5 datetime 2011110516 14:56
    cline2
    1 blay blay blah who knows what, but anyway it may have a comma
    2 blay blay blah who knows what
    3 datetime 2011110516 12:59
    4 datetime 2011110516 13:51
    5 blay blay blah who knows what
    cline3
    1
    2
    3 blay blay blah who knows what
    4 blay blay blah who knows what, but anyway it may have a comma
    5

    关于r - 当数据不在表中时,如何将文本文件读入 R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8422949/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com