gpt4 book ai didi

html - 从 HTML 页面读取固定宽度格式的文本表格

转载 作者:行者123 更新时间:2023-11-28 01:57:38 25 4
gpt4 key购买 nike

我正在尝试从类似于以下的表中读取数据 http://www.fec.gov/pubrec/fe1996/hraz.htm使用 R 但一直无法取得进展。我意识到为此我需要使用 XML 和 RCurl,但尽管网络上有许多其他示例涉及类似问题,但我无法解决这个问题。

第一个问题是该表在查看时只是一个表,但没有编码。将其视为 xml 文档,我可以访问表中的“数据”,但因为我想获取多个表,所以我认为这不是最优雅的解决方案。

将其视为 html 文档可能会更好,但我对 xpathApply 相对不熟悉,并且不知道如何获取表中的实际“数据”,因为它没有被任何东西括起来(即 i-/i 或b-/b).

我过去使用 xml 文件取得过一些成功,但这是我第一次尝试使用 html 文件做类似的事情。特别是这些文件似乎比我见过的其他示例结构更少。

非常感谢任何帮助。

最佳答案

假设您可以将 html 输出读取到一个文本文件中(相当于从您的网络浏览器复制+粘贴),这应该让你有很大的进步:

# x is the output from the website 


library(stringr)
library(data.table)

# First, remove commas from numbers (easiest to do at beginning)
x <- gsub(",([0-9])", "\\1", x)

# split the data by District
districts <- strsplit(x, "DISTRICT *")[[1]]

# separate out the header info
headerInfo <- districts[[1]]
districts <- tail(districts, -1)


# grab the straggling district number, use it as a name and remove it

# end of first line
eofl <- str_locate(districts, "\n")[,2]

# trim white space and assign as name
names(districts) <- str_trim(substr(districts, 1, eofl))

# remove first line
districts <- substr(districts, eofl+1, nchar(districts))

# replace the ending '-------' and trime white space
districts <- str_trim(str_replace_all(districts, "---*", ""))

# Adjust delimeter (this is the tricky part)

## more than two spaces are a spearator
districts <- str_replace_all(districts, " +", "\t")

## lines that are total tallies are missing two columns.
## thus, need to add two extra delims. After the first and third columns

# this function will
padDelims <- function(section, splton) {
# split into lines
section <- strsplit(section, splton)[[1]]
# identify lines starting with totals
LinesToFix <- str_detect(section, "^Total")
# pad appropriate columns
section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])

# any rows missing delims, pad at end
counts <- str_count(section, "\t")
toadd <- max(counts) - counts
section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd)

# paste it back together and return
paste(section, collapse=splton)
}

districts <- lapply(districts, padDelims, splton="\n")

# reading the table and simultaneously addding the district column
districtTables <-
lapply(names(districts), function(d)
data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
# ... or without adding district number:
## lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))

# flatten it
votes <- do.call(rbind, districtTables)
setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )

示例表:

 votes

Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
1: Salmon, Matt R 33672 100.00 135634.00 60.18 1
2: Total Party Votes: 33672 NA NA NA 1
3: NA NA NA NA 1
4: Cox, John W(D)/D 1942 100.00 89738.00 39.82 1
5: Total Party Votes: 1942 NA NA NA 1
6: NA NA NA NA 1
7: Total District Votes: 35614 NA 225372.00 NA 1
8: Pastor, Ed D 29969 100.00 81982.00 65.01 2
9: Total Party Votes: 29969 NA NA NA 2
10: NA NA NA NA 2
...
51: Hayworth, J.D. R 32554 100.00 121431.00 47.57 6
52: Total Party Votes: 32554 NA NA NA 6
53: NA NA NA NA 6
54: Owens, Steve D 35137 100.00 118957.00 46.60 6
55: Total Party Votes: 35137 NA NA NA 6
56: NA NA NA NA 6
57: Anderson, Robert LBT 148 100.00 14899.00 5.84 6
58: NA NA NA NA 6
59: Total District Votes: 67839 NA 255287.00 NA 6
60: NA NA NA NA 6
61: Total State Votes: 368185 NA 1356446.00 NA 6
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District

关于html - 从 HTML 页面读取固定宽度格式的文本表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16051292/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com