gpt4 book ai didi

Python 和 BS4 |获取所有具有特定文本内容的表格数据

转载 作者:太空宇宙 更新时间:2023-11-03 19:55:56 24 4
gpt4 key购买 nike

我对 Python 和网络抓取还很陌生,因此提出以下问题。

我只想获取其中包含特定内容的表格。

HTML 的外观如下:它不是此脚本中的第一个表,因此我想选择

    </TABLE></TD></TR>
<TR>
<TD COLSPAN=7 class='x2'>
&nbsp;</TD>
</TR>
<TR>
<TD style="vertical-align:bottom" class='x3'>
EingangsdatumDMYY</TD>
<TD style="vertical-align:bottom" class='x4'>
Techniker</TD>
<TD style="vertical-align:bottom" class='x5'>
Techn.</TD>
<TD style="vertical-align:bottom" class='x6'>
Kunde</TD>
<TD style="vertical-align:bottom" class='x7'>
OffAuftrag</TD>
<TD style="vertical-align:bottom" class='x8'>
Planungsdatum</TD>
<TD style="vertical-align:bottom" class='x8'>
Herstellerreferenz</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x17_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
&nbsp;</TD>
<TD class='x15_0'>
&nbsp;</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x18_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product B**</TD>
<TD class='x14_0'>
&nbsp;</TD>
<TD class='x15_0'>
&nbsp;</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x19_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
&nbsp;</TD>
<TD class='x15_0'>
&nbsp;</TD>
</TR>

我知道此代码中使用的 calsses 很奇怪,但它是生成的,因此无法更改。

现在是我用来通过 BS4 获取 HTML 的代码:

import urllib2
from bs4 import BeautifulSoup

# specify the url
quote_page = 'Website.html'

# query the website and return the html to the variable page
page = urllib2.urlopen(quote_page)


# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
tables = soup.findChildren('table')

my_table = tables[1]
rows = my_table.findChildren(['th', 'tr'])

print my_table

现在的问题是:

我确实得到了第一行,但我想搜索整个网站并搜索其中包含文本“Product A”的每个表并将父级保存在数组中。

例如:代码完成后,输出将是:

<TD class='x17_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>

<TD class='x19_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>

所以代码必须:1) 搜索 HTML 并搜索文本“Product A”2) 获取父标签并将其保存在变量中。3) 重复整个 HTML。

我感激地接受每一个提示 -

谢谢并致以最诚挚的问候雅尼克·L.

最佳答案

您可以在 Bs4 中使用正则表达式来查找包含特定文本的元素。

如果你想搜索所有包含特定字符串的td,你需要这个

import re
from bs4 import BeautifulSoup
page = '''

<TR>
<TD COLSPAN=7 class='x2'>
&nbsp;</TD>
</TR>
<TR>
<TD style="vertical-align:bottom" class='x3'>
EingangsdatumDMYY</TD>
<TD style="vertical-align:bottom" class='x4'>
Techniker</TD>
<TD style="vertical-align:bottom" class='x5'>
Techn.</TD>
<TD style="vertical-align:bottom" class='x6'>
Kunde</TD>
<TD style="vertical-align:bottom" class='x7'>
OffAuftrag</TD>
<TD style="vertical-align:bottom" class='x8'>
Planungsdatum</TD>
<TD style="vertical-align:bottom" class='x8'>
Herstellerreferenz</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x17_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
&nbsp;</TD>
<TD class='x15_0'>
&nbsp;</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x18_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product B**</TD>
<TD class='x14_0'>
&nbsp;</TD>
<TD class='x15_0'>
&nbsp;</TD>
</TR>
<TR>
<TD class='x9_0'>
DATE </TD>
<TD class='x10_0'>
default</TD>
<TD class='x11_0'>
00000001</TD>
<TD class='x12_0'>
Company Name</TD>
<TD class='x19_0'>
1000000 ,STATUS, TECH, DATE TIME, **Product A**</TD>
<TD class='x14_0'>
&nbsp;</TD>
<TD class='x15_0'>
&nbsp;</TD>
</TR>
'''
soup = BeautifulSoup(page, 'html.parser')
tables = soup.findChildren('td', text=re.compile(r'Product A'))
print(tables)

关于Python 和 BS4 |获取所有具有特定文本内容的表格数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59532221/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com