gpt4 book ai didi

python - 无法正确解析某些元素的名称

转载 作者:行者123 更新时间:2023-12-01 02:27:53 25 4
gpt4 key购买 nike

我用 python 编写了一个脚本来解析一些元素中的一些名称。当我执行脚本时,它确实解析名称,但输出看起来很奇怪。这些名字的解析方式使得它看起来像两个大名字。名称由 br 标记分隔。我怎样才能分别获得每个名字?

其名称为的元素:

html_content='''
<div class="second-child"><div class="richText"> <p></p>
<p><strong>D<br></strong>Daiwa House Industry<br>Danske Bank<br>DaVita HealthCare Partners<br>Delphi Automotive<br>Denso<br>Dentsply International<br>Deutsche Boerse<br>Deutsche Post<br>Deutsche Telekom<br>Diageo<br>Dialight<br>Digital Realty Trust<br>Donaldson Company<br>DSM<br>DS Smith </p>
<p><strong>E<br></strong>East Japan Railway Company<br>eBay<br>EDP Renováveis<br>Edwards Lifesciences<br>Elekta<br>EnerNOC<br>Enphase Energy<br>Essilor<br>Etsy<br>Eurazeo<br>European Investment Bank (EIB)<br>Evonik Industries<br>Express Scripts&nbsp;<br><br><strong>F<br></strong>Fielmann<br>First Solar<br>FMO<br>Ford Motor<br>Fresenius Medical Care<br><br></p></div></div>
'''

我编写的用于解析名称的脚本:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content,"lxml")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)

我的输出(部分结果):

DDaiwa House IndustryDanske BankDaVita HealthCare PartnersDelphi AutomotiveDensoDentsply InternationalDeutsche

我想要得到的输出:

DDaiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International

仅供引用,当我仔细查看结果时,我可以发现每个单独的名称都彼此相连,之间没有间隙。

最佳答案

使用item.text删除所有标签,需要替换<br>标签为 '\n' 。使用 Ian Mackinnon 提供的答案对于问题:Convert </br> to end line

你的脚本应该是:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content,"lxml")

for br in soup.find_all("br"):
br.replace_with("\n")

for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)

和输出:

 D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith E
East Japan Railway Company
eBay
EDP Renováveis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 

F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care

关于python - 无法正确解析某些元素的名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47179002/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com