gpt4 book ai didi

r - 将冒号分隔的列表解析为 data.frame

转载 作者:行者123 更新时间:2023-12-01 12:13:12 27 4
gpt4 key购买 nike

这个问题是this的后续问题.

以下 metadata.txt 由以下人员生成:pdftk sample.pdf dump_data > metadata.txt

元数据.txt:

InfoBegin
InfoKey: ModDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: CreationDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: Creator
InfoValue: Adobe Acrobat 7.0
InfoBegin
InfoKey: Producer
InfoValue: Mac OS X 10.9.5 Quartz PDFContext
PdfID0: 76cf9fd41f0778314abfec8b34d8388d
PdfID1: 76cf9fd41f0778314abfec8b34d8388d
NumberOfPages: 612
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 1
BookmarkPageNumber: 11
BookmarkBegin
BookmarkTitle: Preface
BookmarkLevel: 1
BookmarkPageNumber: 5
BookmarkBegin
BookmarkTitle: Explanatory Note and Abbreviations Used
BookmarkLevel: 1
BookmarkPageNumber: 7
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 405 616
PageMediaDimensions: 405 616

我希望 R 将目录 (TOC) 信息从 metadata.txt 读取到 data.frame 中,从第一个 BookmarkBegin 开始到BookmarkPageNumber 紧接在 PageMediaBegin 之前。

可以通过以下代码过滤出感兴趣的区域:

require(stringi)

connect=file('metadata.txt')
metadata=readLines(connect)

existing_toc=c(min(grep('BookmarkBegin', metadata)),max(grep('BookmarkPageNumber', metadata)))
metadata_toc=metadata[existing_toc[1]:existing_toc[2]]

删除 BookmarkBegin 并在每次第一次出现 : 时拆分每行的字符串,通过:

toc_data=metadata_toc[-grep('BookmarkBegin', metadata_toc)]
toc_data_split=stri_split_fixed(toc_data, ": ", n=2)

让我得到以下列表:

[[1]]
[1] "BookmarkTitle" "Contents"

[[2]]
[1] "BookmarkLevel" "1"

[[3]]
[1] "BookmarkPageNumber" "11"

[[4]]
[1] "BookmarkTitle" "Preface "

[[5]]
[1] "BookmarkLevel" "1"

[[6]]
[1] "BookmarkPageNumber" "5"

[[7]]
[1] "BookmarkTitle"
[2] "Explanatory Note and Abbreviations Used "

[[8]]
[1] "BookmarkLevel" "1"

[[9]]
[1] "BookmarkPageNumber" "7"

我应该如何从这里继续获取像这样的 data.frame:

structure(list(BookmarkTitle = structure(c(1L, 3L, 2L), .Label = c("Contents", 
"Explanatory Note and Abbreviations Used", "Preface"), class = "factor"),
BookmarkLevel = c(1, 1, 1), BookMarkPageNumber = c(11, 5,
7)), .Names = c("BookmarkTitle", "BookmarkLevel", "BookMarkPageNumber"
), row.names = c(NA, -3L), class = "data.frame")

BookmarkTitle BookmarkLevel
1 Contents 1
2 Preface 1
3 Explanatory Note and Abbreviations Used 1
BookMarkPageNumber
1 11
2 5
3 7

最佳答案

此基本解决方案会将 metadata_toc 转换为数据框。首先用空行替换没有冒号的每一行。它现在采用 Debian 控制文件 (DCF) 格式,因此请使用 read.dcf 读取它。将生成的矩阵 m 转换为数据框 DF 并将列类型转换为字符和数字。

metadata_toc[grep(":", metadata_toc, invert = TRUE)] <- ""
m <- read.dcf(textConnection(metadata_toc))
DF <- as.data.frame(m, stringsAsFactors = FALSE)
DF[] <- lapply(DF, type.convert, as.is = TRUE)

给予:

> DF
BookmarkTitle BookmarkLevel BookmarkPageNumber
1 Contents 1 11
2 Preface 1 5
3 Explanatory Note and Abbreviations Used 1 7

注意事项

metadata_toc <- c("BookmarkBegin", "BookmarkTitle: Contents", "BookmarkLevel: 1", 
"BookmarkPageNumber: 11", "BookmarkBegin", "BookmarkTitle: Preface ",
"BookmarkLevel: 1", "BookmarkPageNumber: 5", "BookmarkBegin",
"BookmarkTitle: Explanatory Note and Abbreviations Used ", "BookmarkLevel: 1",
"BookmarkPageNumber: 7")

关于r - 将冒号分隔的列表解析为 data.frame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50284647/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com