python - 当 HTML 表包含多个 <tbody> 标签时，让 pandas.read

python - 当 HTML 表包含多个标签时，让 pandas.read_html( ) 工作

转载作者：太空宇宙更新时间：2023-11-03 17:31:38

26

4

我正在尝试解析在http://www.swiftcodesbic.com处找到的表我正在使用 Pandas自动抓取表格。在大多数情况下，这工作正常，但有一个表有两个 <tbody>标签，我认为这引起了问题。故障表可以查到here .

我用来将 html 解析为 pandas.DataFrame 的代码是:

pandas.read_html(countryPage.text, attrs={"id":"t2"}, skiprows=1)[0]

哪里countryPage是 requests.get()目的。我可以在 pandas 调用中添加任何内容来告诉它捕获第二个 <tbody>标签？或者，如果这不是问题，有人可以解释可能导致它返回“找不到表”错误的原因吗？提前致谢。

编辑

这是我当前正在使用的解决方案，但我仍然想知道一种更“Pythonic”的方法。

try:
  tempDataFrame = pd.read_html(countryPage.text, attrs={"id":"t2"}, skiprows=1)[0]
except:
  if "france" is in url: #pseudo-code
    soup = BeautifulSoup(countryPage.text)
    table = soup.find_all("table")[2].findAll('tbody')[1] #this will vary based on your situation
    table = "<table>" + str(table) + "</table>" #pandas needs the table tag to recognize a table
    tempDataFrame = pd.read_html(table)[0]

同样，我有兴趣知道如何以更有效的方式做到这一点。

最佳答案

使用 match 参数应该可以解决问题。来自 pandas.read_html 文档:

match : str or compiled regular expression, optional

    The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.

尝试这样的事情

tempDataFrame = pd.read_html(countryPage.text, match='foo', skiprows=1)

其中 foo 是表中包含的字符串

关于python - 当 HTML 表包含多个 <tbody> 标签时，让 pandas.read_html( ) 工作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31756508/

26

4

0

文章推荐： python - Tox virtualenv 混淆

文章推荐： ruby-on-rails - 创建新记录时如何传递current_user？

文章推荐： ruby - 有没有办法初始化哈希的所有子节点？

文章推荐： html - CSS3文本框水平对齐，间隔一个曲率

首页

博学

6Ren·AI

商城

python - 当 HTML 表包含多个标签时，让 pandas.read_html( ) 工作

标签)？
根据 Web 标准，创建带有标题 1 的链接的正确代码是什么？是吗 stackoverflow 或 stackoverflow 谢谢最佳答案根据网络标准，您不能将 block 元素放入内

首页

博学

6Ren·AI

商城

python - 当 HTML 表包含多个 标签时，让 pandas.read_html( ) 工作

标签)？ 根据 Web 标准，创建带有标题 1 的链接的正确代码是什么？ 是吗 stackoverflow 或 stackoverflow 谢谢 最佳答案 根据网络标准，您不能将 block 元素放入内

python - 当 HTML 表包含多个标签时，让 pandas.read_html( ) 工作

标签)？
根据 Web 标准，创建带有标题 1 的链接的正确代码是什么？是吗 stackoverflow 或 stackoverflow 谢谢最佳答案根据网络标准，您不能将 block 元素放入内