gpt4 book ai didi

python - 当表格单元格采用混合格式时抓取维基百科信息框

转载 作者:行者123 更新时间:2023-11-30 22:01:11 25 4
gpt4 key购买 nike

我正在尝试抓取维基百科信息框并获取某些关键字的信息。例如:https://en.wikipedia.org/wiki/A%26W_Root_Beer

假设我正在寻找制造商的值。我希望它们出现在列表中,并且我只想要它们的文本。因此,在这种情况下,所需的输出将是 ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']。无论我尝试什么,我都无法成功生成此列表。这是我的一段代码:

url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:

th = tr.find("th")
td = tr.find("td")

# take th.text and td.text

我想要一种可以在各种情况下工作的方法:当途中有换行符时,当某些值是链接时,当某些值是段落时等。在所有情况下,我只想要我们在屏幕上看到的文本,不是链接,不是段落,只是纯文本。我也不希望输出为 Keurig Dr Pepper (United States, Worldwide)A&W Canada (Canada),因为稍后我希望能够解析结果并对每个结果执行一些操作实体。

我正在浏览许多维基百科页面,但我找不到适合其中大部分页面的方法。你能帮助我编写工作代码吗?我不擅长抓取。

最佳答案

好吧,这是我的尝试(json 库只是为了漂亮地打印字典):

import json
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/ABC_Studios"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})

list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:

th = tr.find("th")
td = tr.find("td")
if th is not None:
innerText = ''
for elem in td.recursiveChildGenerator():
if isinstance(elem, str):
innerText += elem.strip()
elif elem.name == 'br':
innerText += '\n'
info[th.text] = innerText

print(json.dumps(info, indent=1))

该代码替换了 <br/>标签为 \n ,这给出:

{
"Trading name": "ABC Studios",
"Type": "Subsidiary\nLimited liability company",
"Industry": "Television production",
"Predecessor": "Touchstone Television",
"Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
"Headquarters": "Burbank, California,U.S.",
"Area served": "Worldwide",
"Key people": "Patrick Moran (President)",
"Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
"Website": "abcstudios.go.com"
}

如果您想返回列表而不是\n的字符串,您可以调整它

    innerTextList = innerText.split("\n")
if len(innerTextList) < 2:
info[th.text] = innerTextList[0]
else:
info[th.text] = innerTextList

这给出:

{
"Trading name": "ABC Studios",
"Type": [
"Subsidiary",
"Limited liability company"
],
"Industry": "Television production",
"Predecessor": "Touchstone Television",
"Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
"Headquarters": "Burbank, California,U.S.",
"Area served": "Worldwide",
"Key people": "Patrick Moran (President)",
"Parent": [
"ABC Entertainment Group",
"(Disney\u2013ABC Television Group)"
],
"Website": "abcstudios.go.com"
}

关于python - 当表格单元格采用混合格式时抓取维基百科信息框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54120864/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com