python - 在 Python 中修复损坏的 HTML - Beautifulsoup 不工作-6ren

python - 在 Python 中修复损坏的 HTML - Beautifulsoup 不工作

转载作者：太空宇宙更新时间：2023-11-03 12:47:57

24

4

我有兴趣从此表中抓取文本:https://ows.doleta.gov/unemploy/trigger/2011/trig_100211.html以及其他喜欢它的人。

我写了一个快速的 python 脚本，适用于以类似方式格式化的其他表:

    state = ""
    weeks = ""
    edate = "" 
    pdate = url[-11:]
    pdate = pdate[:-5]

    table = soup.find("table") 

    for row in table.findAll('tr'):     
        cells = row.findAll("td")
        if len(cells) == 13: 
            state = row.find("th").find(text=True) 
            weeks = cells[11].find(text=True) 
            edate = cells[12].find(text=True)
            try:   
                print pdate, state, weeks, edate 
                f.writerow([pdate, state, weeks, edate])
            except:  
                print state[1] + " error"

但是，该脚本不适用于该表，因为一半行的标签已损坏。一半行的格式没有标记以指示行的开头:

</tr> #end of last row, on State0  
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td>
</tr> #theoretically, end of row about State1

因为一半的行格式不正确，BeautifulSoup 会忽略它们。我试过用 tidy 解决这个问题，但是 BeautifulSoup 在阅读它建议的代码时遇到了问题。我考虑过通过在正确的位置生成带有标签的新字符串来解决问题，但我不确定该怎么做。

有什么建议吗？

最佳答案

由于不同的解析器可以自由地处理它们认为合适的损坏的 HTML，因此在这些情况下，在尝试自行修复之前探索它们是如何处理的通常很有用。

在这种情况下，您可能会对如何 html5lib 感兴趣处理这个 - 在我看来它插入了丢失的 <tr>元素而不是丢弃所有孤立的 <td>像 lxml 这样的元素(默认值)。

soup = BeautifulSoup(text) #default parser - lxml

soup.table.find_all('tr')[9]
Out[31]: 
<tr bgcolor="#C0C0C0">
<td align="center" headers="Arizona noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Arizona noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Arizona noinfo" width="25"><font size="-2"> </font></td>
<th align="left" id="Arizona " width="100"><font size="-2">Arizona </font></th>
<td align="center" headers="Arizona noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Arizona noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Arizona 13_week_IUR indicators" width="50"><font size="-2">3.03</font></td>
<td align="center" headers="Arizona pct_of_prior_2years indicators" width="50"><font size="-2">79</font></td>
<td align="center" headers="Arizona 3_mo_satur indicators" width="50"><font size="-2">9.3</font></td>
<td align="center" headers="Arizona year pct_of_prior indicators" width="50"><font size="-2">94</font></td>
<td align="center" headers="Arizona 2nd_year pct_of_prior indicators" width="50"><font size="-2">93</font></td>
<td align="center" headers="Arizona 2nd_year pct_of_prior indicators" width="50"><font size="-2">155</font></td>
<td align="center" headers="Arizona avail_wks pct_of_prior indicators noinfo" width="50"><font size="-2"> </font></td>
<td align="center" headers="Arizona dates periods status" width="100"><font size="-2">E 06-11-2011</font></td>
</tr>

soup = BeautifulSoup(text, 'html5lib')

soup.table.find_all('tr')[9] #same path, different result!
Out[33]: 
<tr><td align="center" headers="Alaska noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Alaska noinfo" width="25"><font size="-2"> </font></td>
<td align="center" headers="Alaska noinfo" width="25"><font size="-2"> </font></td>
<th align="left" id="Alaska " width="100"><font size="-2">Alaska </font></th>
<td align="center" headers="Alaska noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Alaska noinfo" width="50"><font size="-2">2</font></td>
<td align="center" headers="Alaska 13_week_IUR indicators" width="50"><font size="-2">3.82</font></td>
<td align="center" headers="Alaska pct_of_prior_2years indicators" width="50"><font size="-2">90</font></td>
<td align="center" headers="Alaska 3_mo_satur indicators" width="50"><font size="-2">7.6</font></td>
<td align="center" headers="Alaska year pct_of_prior indicators" width="50"><font size="-2">96</font></td>
<td align="center" headers="Alaska 2nd_year pct_of_prior indicators" width="50"><font size="-2">95</font></td>
<td align="center" headers="Alaska 2nd_year pct_of_prior indicators" width="50"><font size="-2">117</font></td>
<td align="center" headers="Alaska avail_wks pct_of_prior indicators noinfo" width="50"><font size="-2"> </font></td>
<td align="center" headers="Alaska dates periods status" width="100"><font size="-2">E 06-11-2011</font></td>
</tr>

bs4 文档中的更多信息:Differences Between Parsers .由于此表在浏览器中呈现时显示正常，并且 html5lib尝试以与浏览器相同的方式解析页面，可以肯定的是，这就是您想要的。

关于python - 在 Python 中修复损坏的 HTML - Beautifulsoup 不工作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25269145/

24

4

0

文章推荐： java - 如何快速制作我的 Android 应用程序

文章推荐： ssl - curl:(51) : SSL 证书使用者名称与目标主机名不匹配

文章推荐： android - 删除数据时 FirebaseRecyclerAdapter 不更新项目索引

文章推荐： android - KIVY - Python 在按下按钮时继续执行

c - Posix AIO 损坏/损坏？
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 8 年前。 Improve this qu
F# - 损坏 "then"
我目前正在尝试制作一个非常简单的应用程序，它会根据一天中的时间问候。我的代码是: open System let read() = Console.Read() let readLine() = Co
elasticsearch - 损坏/未分配的Elasticsearch索引
我已经运行Elasticsearch服务很长时间了，但是突然遇到了以下情况由以下原因导致:org.elasticsearch.index.translog.TranslogCorruptedExce
browser - Cookie 损坏
我对执行以下操作的 php 重定向脚本有一个奇怪的问题: 在用户的浏览器中植入 Cookie，或者读取现有 Cookie(如果有)。将用户重定向到另一个网址(重定向的网址是原始网址中的参数，例如 h
itext - 表格单元格水平对齐被忽略/损坏
我正在使用 iText 7.0.0(Java 风格)，似乎表格单元格 HorizontalAlignment 被忽略，因为 CENTER 和 RIGHT 都不起作用。你能重现这个吗？ see th
swift - 变量多线程访问 - 损坏
简而言之: 我有一个可以从多个线程访问的计数器变量。尽管我已经实现了多线程读/写保护，但该变量似乎仍然以不一致的方式同时写入，导致计数器结果不正确。深入杂草: 我使用的“for 循环”会在后台触发大
Java:ArrayList 损坏？
我有一个 REST 项目，在访问控制服务类中保存用户的ArrayList。一切都工作正常，直到 REST Web 服务突然抛出 java.util.NoSuchElementException。单步查
正常重启后 MySQL 损坏
已关闭。此问题不符合Stack Overflow guidelines 。它目前不接受答案。这个问题似乎不是关于 a specific programming problem, a software
javascript - 刷新页面时本地存储加载投票(损坏)
当我刷新页面时，我无法显示 voteUp/Down，因为如果我执行 voteUp/Down(+1 或 -1) 并刷新页面，这会再次返回 voteUp/Down (0)。过去我使用 JSON，但社区推荐
c++ - 为什么链表中的数据在嵌套函数中发生更改/损坏？
我正在为离散时间 CPU 调度模拟器编写代码。它只是生成流程并相应地安排它们。我目前正在实现 FCFS 计划。我理解离散时间模拟器的本质，但我在用 C++ 实现时遇到了麻烦。问题出现在handleN
centos - Rpmdb 损坏
尝试使用 yum 部署包时出现错误: 2016-07-07 14:14:31,296 - ERROR - error: rpmdb: BDB0113 Thread/process 6723/1
堆的 C++ 损坏
我有一个简单的同步队列 template class SynchronisedQueue { public: void Enqueue(const T& d
Hadoop 损坏 block
我正在使用 hadoop 0.20.append 和 hbase 0.90.0。我将少量数据上传到 Hbase，然后出于评估目的杀死了 HMaster 和 Namenode。在此之后，我向 Hbase
PHP session 损坏
我使用 symfony 框架 1.4 创建了一个网站。我正在使用 sfguard 进行身份验证。现在，这在 WAMP (windows) 上运行良好。我可以在不同的浏览器上登录多个帐户并使用该网站。
java - HashMap 损坏/性能问题
目前我已经实现了 HashMap private static Map cached = new HashMap(); 和 Item 是一个具有属性的对象 Date expireTime 和 byte
WPF 单向绑定(bind)损坏
我试图将 2 个不同的 WPF 控件绑定(bind)到 ViewModel 中的同一属性，即 CheckBox.IsChecked 和 Expander.IsExpanded。我想要实现的行为是让 C
Gradle processResources 损坏 .jks
我希望这是一个简单的问题，但我没有找到答案。我想让 build.gradle 文件通过替换某些变量来设置我的 Spring Boot 应用程序中的版本。这与广告一样有效: def tokens =
c++ - 库包含 WinRT 损坏
已关闭。此问题需要 debugging details 。目前不接受答案。编辑问题以包含 desired behavior, a specific problem or error, and the
c++ - OpenGL 批处理渲染器中的纹理出血/损坏
这个问题在这里已经有了答案: In a fragment shader, why can't I use a flat input integer to index a uniform array o
java - OSM xml 损坏？
我已经下载了 OSM 世界地图。解析时出现异常: osm bound changeset (...) changeset Exception in thread "main" org.xml.sax.

首页

博学

6Ren·AI

商城

python - 在 Python 中修复损坏的 HTML - Beautifulsoup 不工作