gpt4 book ai didi

python - 使用 beautifulsoup 获取 HTML 中链接标签内的标题

转载 作者:行者123 更新时间:2023-11-30 22:34:03 24 4
gpt4 key购买 nike

我正在从https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business提取数据并得到了我想要的输出,但现在的问题是:我得到的输出是业务支持和......以及澳大利亚储备银行......,不是完整的文本,我想打印整个文本而不是“...... ...“对全部。我将 jezrael 回答中的第 9 行和第 10 行替换为,请引用Fetching content from html and write fetched content in a specific format in CSV使用代码
org = soup.find_all('a', {'class':'nav-item active'})[0].get('title')
groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title')
。我单独运行它并收到错误:列表索引超出范围。我应该用什么来提取完整​​的句子?我也尝试过:org = soup.find_all('span',class_="filteredpil"),当我单独运行但无法运行整个代码时,它给出了字符串类型的答案。

最佳答案

所有较长文本的数据都在属性 title 中,较短的数据在文本中。所以添加双if:

for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")

lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
l = soup.find_all('li', class_="nav-item active")

org = l[0].a.get('title')
if org == '':
org = l[0].span.get_text()

groups = l[1].a.get('title')
if groups == '':
groups = l[1].span.get_text()

data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups

#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
<小时/>
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)

df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
<小时/>
print (df1.head())

Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly

link \
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...

Organisation Group
0 Reserve Bank of Australia Business Support and Regulation
1 Reserve Bank of Australia Business Support and Regulation
2 Reserve Bank of Australia Business Support and Regulation
3 Reserve Bank of Australia Business Support and Regulation
4 Reserve Bank of Australia Business Support and Regulation

关于python - 使用 beautifulsoup 获取 HTML 中链接标签内的标题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44964419/

24 4 0