html - 从 HTML 页面读取固定宽度格式的文本表格-6ren

html - 从 HTML 页面读取固定宽度格式的文本表格

转载作者：行者123 更新时间：2023-11-28 01:57:38

25

4

我正在尝试从类似于以下的表中读取数据 http://www.fec.gov/pubrec/fe1996/hraz.htm使用 R 但一直无法取得进展。我意识到为此我需要使用 XML 和 RCurl，但尽管网络上有许多其他示例涉及类似问题，但我无法解决这个问题。

第一个问题是该表在查看时只是一个表，但没有编码。将其视为 xml 文档，我可以访问表中的“数据”，但因为我想获取多个表，所以我认为这不是最优雅的解决方案。

将其视为 html 文档可能会更好，但我对 xpathApply 相对不熟悉，并且不知道如何获取表中的实际“数据”，因为它没有被任何东西括起来(即 i-/i 或b-/b).

我过去使用 xml 文件取得过一些成功，但这是我第一次尝试使用 html 文件做类似的事情。特别是这些文件似乎比我见过的其他示例结构更少。

非常感谢任何帮助。

最佳答案

假设您可以将 html 输出读取到一个文本文件中(相当于从您的网络浏览器复制+粘贴)，这应该让你有很大的进步:

# x is the output from the website 


library(stringr)
library(data.table)

# First, remove commas from numbers (easiest to do at beginning)
x <- gsub(",([0-9])", "\\1", x)

# split the data by District
districts <- strsplit(x, "DISTRICT *")[[1]]

# separate out the header info
headerInfo <- districts[[1]]
districts <- tail(districts, -1)


# grab the straggling district number, use it as a name and remove it 

    # end of first line
    eofl <- str_locate(districts, "\n")[,2]

    # trim white space and assign as name
    names(districts) <- str_trim(substr(districts, 1, eofl))

    # remove first line
    districts <- substr(districts, eofl+1, nchar(districts))

# replace the ending '-------' and trime white space
    districts <- str_trim(str_replace_all(districts, "---*", ""))

# Adjust delimeter (this is the tricky part)

    ## more than two spaces are a spearator
    districts <- str_replace_all(districts, "  +", "\t")

    ## lines that are total tallies are missing two columns. 
    ##   thus, need to add two extra delims. After the first and third columns

        # this function will 
        padDelims <- function(section, splton) {
          # split into lines
          section <- strsplit(section, splton)[[1]]
          # identify lines starting with totals
          LinesToFix <- str_detect(section, "^Total")
          # pad appropriate columns
          section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])

          # any rows missing delims, pad at end
          counts <- str_count(section, "\t")
          toadd  <- max(counts) - counts
          section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd) 

          # paste it back together and return
          paste(section, collapse=splton)
        }

    districts <- lapply(districts, padDelims, splton="\n")

    # reading the table and simultaneously addding the district column
    districtTables <- 
       lapply(names(districts), function(d) 
         data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
    # ... or without adding district number: 
    ##       lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))

    # flatten it 
    votes <- do.call(rbind, districtTables)
    setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )

示例表:

 votes

                        Candidate      Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
 1:                  Salmon, Matt          R         33672         100.00        135634.00             60.18        1
 2:            Total Party Votes:                    33672             NA               NA                NA        1
 3:                                                     NA             NA               NA                NA        1
 4:                     Cox, John     W(D)/D          1942         100.00         89738.00             39.82        1
 5:            Total Party Votes:                     1942             NA               NA                NA        1
 6:                                                     NA             NA               NA                NA        1
 7:         Total District Votes:                    35614             NA        225372.00                NA        1
 8:                    Pastor, Ed          D         29969         100.00         81982.00             65.01        2
 9:            Total Party Votes:                    29969             NA               NA                NA        2
10:                                                     NA             NA               NA                NA        2
...
51:                Hayworth, J.D.          R         32554         100.00        121431.00             47.57        6
52:            Total Party Votes:                    32554             NA               NA                NA        6
53:                                                     NA             NA               NA                NA        6
54:                  Owens, Steve          D         35137         100.00        118957.00             46.60        6
55:            Total Party Votes:                    35137             NA               NA                NA        6
56:                                                     NA             NA               NA                NA        6
57:              Anderson, Robert        LBT           148         100.00         14899.00              5.84        6
58:                                                     NA             NA               NA                NA        6
59:         Total District Votes:                    67839             NA        255287.00                NA        6
60:                                                     NA             NA               NA                NA        6
61:            Total State Votes:                   368185             NA       1356446.00                NA        6
                        Candidate      Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District

关于html - 从 HTML 页面读取固定宽度格式的文本表格，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16051292/

25

4

0

文章推荐： javascript - 如何拦截Model中的 'set'函数

文章推荐： javascript - 从本地目录加载文件？

文章推荐： javascript - 使用 observableArray 初始化计算变量

文章推荐： c++ - 只返回 x=a 或 'start' 变量的第一个值

html - 宽度 :100% vs. 宽度:继承
我的理解是 width: 100% 让元素的宽度与其父元素的宽度相同，而 width: inherit 只有在明确指定父元素的宽度时才这样做.这种理解是否正确？如果是这样，在我看来，当 width:
iphone - 如何设置 'div'元素满屏(宽度)，当iphone改变方向时，它也是全屏(宽度)
并设置“高度”为全屏的 1/2。这是我的代码: div{ background:red; } 最佳答案我会结合使用 css 和 javascript(使
html - CSS 表格 100% 宽度，td 宽度
编辑 2: 问题似乎出在规则的“bigTable”元素上。显然，在布局模板上使用时继承了错误的最小宽度。我仍在调查此事。不过，我将再尝试一次 div。一个大问题是使用固定导航和动态内容，但我已经为此
html - 100% 宽度 = 视口(viewport)宽度。如何让它和页面正文一样宽？
我的网站需要显示宽表。在它上面是标题，它应该和整个页面一样宽(在这种情况下，和表格一样宽)。但是，它的宽度与视口(viewport)(屏幕尺寸)一样宽，因此显示时看起来还不错，但是一旦用户滚动到侧面，
html - 我的 25% 宽度 flex 元素在包装后占用 100% 宽度
我有一个小问题。我总是使用 float 来安排我的元素。我正在转向 flexbox，我做了一些例子，一切都很好，但我正在做一个事情进展不顺利的例子。我有一个包含 1 到 12 种产品的容器，每行 4
html - CSS 宽度/高度与 HTML 宽度/高度属性具有不同的效果，具体取决于元素类型
例如，它们在自动边距方面会导致完全不同的行为。看看这个 fiddle :https://jsfiddle.net/L1rk46xy/ .fixed { display:fixed;
css - 在 Wordpress 帖子中为文本设置 75% 宽度，为图像设置 100% 宽度？
我尝试在帖子中将段落的宽度设置为 75%，并将图像的响应宽度设置为 100%。然而，总是在默认。 Some texts Some texts Some texts Some texts 目前，我只
html - img 元素上的 HTML 宽度/高度属性和 CSS 宽度/高度属性有什么区别？
HTML 元素可以有 width/height 属性，也可以有 CSS width/height 属性: HTML 属性和 CSS 属性有什么区别，它们应该具有相同的效果吗？最佳答案有关该主题的
javascript - jQuery - 循环遍历 TD 宽度，然后循环遍历 TH 并应用 TD 宽度
我有一个流动的 table ，现在需要一个固定的 thead。问题是当你固定 thead 时，th-s 不尊重 tbody 的 td-s 的宽度。列的大小都由 BootStrap 处理。我已经阅读了很
javascript - 两个相邻的 50% 宽度/等高字段集的正确 HTML 对齐/大小低于 100% 宽度？
我想像这样布置一个区域: ---- ---- |A | |B | | | | | ---- ---- --------- |C | --------- 三个盒子中的每一个都是 .盒子
css - Bootstrap 和 Bootstrap-Select - 使 Select 和其他元素适合 div 宽度，100% 宽度/高度
我遇到了很多问题。 1) 我正在使用 Bootstrap-Select 来获得具有搜索功能的现代选择框，但无论我尝试什么，我似乎都无法获得填充 col-span 的选择。 2) 我已将该行拆分为 2
javascript - 如何使用窗口滚动(jquery)调整 div 宽度 - 但 css 中只有 1/3 的 div 宽度
http://jsfiddle.net/95EtZ/1/ 问题在行动中解决了一半。现在它是用 javascript 中硬编码的容器宽度设置的。我需要 js 来获取容器 div 的宽度——使用窗口滚
html - 如果 parent 的 div 也是 100% 宽度/高度，如何给 child div 宽度/高度 100%
我想要两个宽度和高度均为 100% 的 div。我知道子 div 不会工作，因为父 div 没有特定的高度，但有没有办法解决这个问题？ HTML: CSS: body
jQuery动态设置高度、宽度
我需要使用 jQuery 更改的高度和宽度我尝试了以下代码 jQuery('#chart_popup').css('height','600px'); jQuery('#chart_popup')
WPF 高度/宽度
在自定义 WPF 控件中，我想将控件的宽度设置为高度的函数。例如:Width = Height/3 * x; 实现此目的的最佳方法是什么，以便控件正确且流畅地调整大小(和初始大小)？最佳答案您可以
r - R图形图中的顶点边界颜色/宽度
我正在使用igraph在R中绘制图形，执行plot(mygraph, vertex.color = "green")之类的操作。有没有办法改变顶点边界的颜色和/或宽度？最佳答案查看下面的代码；
jquery - 如何使用jquery设置图像的高度、宽度
有没有办法使用jquery设置图像的高度和宽度？以下是我的代码 var img = new Image(); // Create image $(img).load(function(){
Delphi 获取具有滚动条的组件的实际全高/宽度
这个问题类似于 how-to-find-the-actual-width-of-grid-component-with-scrollbar-in-delphi 但我无法获取 CalcDrawInfo(
边框弄乱了 HTML 宽度
这里是 HTML/CSS 新手。试图将我在 Codeacademy 上学到的知识付诸实践，但我遇到了一个问题，即我设置为 width:100% 的 header 最终离开了页面。我相信这是因为边框，
边框弄乱了 HTML 宽度
这里是 HTML/CSS 新手。试图将我在 Codeacademy 上学到的知识付诸实践，但我遇到了一个问题，即我设置为 width:100% 的 header 最终离开了页面。我相信这是因为边框，

首页

博学

6Ren·AI

商城

html - 从 HTML 页面读取固定宽度格式的文本表格