gpt4 book ai didi

python - Beautifulsoup 解析 html 换行符

转载 作者:太空宇宙 更新时间:2023-11-03 14:14:24 24 4
gpt4 key购买 nike

我正在使用 BeautifulSoup 从文本文件中解析一些 HTML。文本被写入字典,如下所示:

websites = ["1"]

html_dict = {}

for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read().splitlines()
html_dict.update({website_id:get_raw_html})

我从 html_dict = {} 解析 HTML查找带有 <p> 的文本标签:

scraped = {}

for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all('p')

这就是 html_dict 中的 HTML 内容看起来像:

<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>

问题是,BeautifulSoup 似乎正在考虑换行并忽略第二行。所以当我打印出scrape_selected_tags时输出是...

<p>Hey, this should be scraped</p>

当我期待全文时。

如何避免这种情况?我尝试过分割 html_dict 中的行这似乎不起作用。提前致谢。

最佳答案

调用splitlines当您阅读 html 文档时,您会破坏字符串列表中的标签。
相反,您应该读取字符串中的所有 html。

websites = ["1"]
html_dict = {}

for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read()
html_dict.update({website_id:get_raw_html})

然后删除内部 for 循环,这样就不会迭代该字符串。

scraped = {}

for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
soup = BeautifulSoup(raw_html, 'html.parser')
scrape_selected_tags = soup.find_all('p')
<小时/>

BeautifulSoup可以处理标签内的换行符,让我举个例子:

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))

[<p>Hey, this should be scraped\nbut this part gets ignored for some reason.</p>]

但是如果你将一个标签拆分为多个BeautifulSoup对象:

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

for line in html.splitlines():
soup = BeautifulSoup(line, 'html.parser')
print(soup.find_all('p'))

[<p>Hey, this should be scraped</p>]
[]

关于python - Beautifulsoup 解析 html 换行符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48288374/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com