gpt4 book ai didi

python - Web Scraping with/BeautifulSoup4 - 如何过滤包含特定字符串的标签?

转载 作者:行者123 更新时间:2023-12-04 08:56:45 26 4
gpt4 key购买 nike

如何过滤以下 HTML 片段以将包含“Codigo”的 span 标记附加到列表 A;包含“Acao”的跨度标签到列表 B 等?

Expected output:

List A: ['ABEV3', 'AZUL4']
List B: ['AMBEV S/A', 'AZUL']
List C: ['ON', 'PN']
List D: [4355174839, 326903173]
List E: [2.948, 0.432]
[...]
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
[...]

最佳答案

要获取各种列表,您可以使用 CSS 选择器 [id$="..."] ,它将找到带有 id= 的标签以指定的字符串结尾。例如:

from bs4 import BeautifulSoup


html_data = '''
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblCodigo">ABEV3</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblAcao">AMBEV S/A</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblTipo">ON</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblQtdeTeorica_Formatada">4.355.174.839</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl04_lblPart_Formatada">2,948</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblCodigo">AZUL4</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblAcao">AZUL</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblTipo">PN N2</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblQtdeTeorica_Formatada">326.903.173</span>,
<span id="ctl00_contentPlaceHolderConteudo_grdResumoCarteiraTeorica_ctl00_ctl06_lblPart_Formatada">0,432</span>
'''

soup = BeautifulSoup(html_data, 'html.parser')

list_a = [t.text for t in soup.select('[id$="_lblCodigo"]')]
list_b = [t.text for t in soup.select('[id$="_lblAcao"]')]
list_c = [t.text for t in soup.select('[id$="_lblTipo"]')]
list_d = [int(t.text.replace('.', '')) for t in soup.select('[id$="_lblQtdeTeorica_Formatada"]')]
list_e = [float(t.text.replace(',', '.')) for t in soup.select('[id$="_lblPart_Formatada"]')]

print(list_a)
print(list_b)
print(list_c)
print(list_d)
print(list_e)
打印:
['ABEV3', 'AZUL4']
['AMBEV S/A', 'AZUL']
['ON', 'PN N2']
[4355174839, 326903173]
[2.948, 0.432]

关于python - Web Scraping with/BeautifulSoup4 - 如何过滤包含特定字符串的标签?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63780662/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com