gpt4 book ai didi

python - 通过抓取收集信息

转载 作者:行者123 更新时间:2023-12-01 10:56:30 26 4
gpt4 key购买 nike

我正在尝试通过抓取维基百科来收集政客的名字。我需要的是从此页面中删除所有各方:https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito ,然后为每个人抓取该政党内所有政客的姓名(对于我上面提到的链接中列出的每个政党)。

我编写了以下代码:

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito")
soup = bs(res.text, "html.parser")
array1 = {}
possible_links = soup.find_all('a')
for link in possible_links:
url = link.get("href", "")
if "/wiki/Provenienza" in url: # It is incomplete, as I should scrape also links including word "Politici di/dei"
res1=requests.get("https://it.wikipedia.org"+url)
print("https://it.wikipedia.org"+url)
soup = bs(res1, "html.parser")
possible_links1 = soup.find_all('a')
for link in possible_links1:
url_1 = link.get("href", "")
array1[link.text.strip()] = url_1

但它不起作用,因为它不收集各方的姓名。它收集所有政党(来 self 上面提到的维基百科页面):但是,当我尝试抓取政党页面时,它不会收集该政党内政客的姓名。

希望你能帮助我。

最佳答案

您可以从第一页收集网址和政党名称,然后循环这些网址并将关联的政客姓名列表添加到以政党名称为键的字典中。您可以通过使用 session 对象来提高效率,从而重用底层 tcp 连接

from bs4 import BeautifulSoup as bs
import requests

results = {}

with requests.Session() as s: # use session object for efficiency of tcp re-use
s.headers = {'User-Agent': 'Mozilla/5.0'}
r = s.get('https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito')
soup = bs(r.content, 'lxml')
party_info = {i.text:'https://it.wikipedia.org/' + i['href'] for i in soup.select('.CategoryTreeItem a')} #dict of party names and party links

for party, link in party_info.items():
r = s.get(link)
soup = bs(r.content, 'lxml')
results[party] = [i.text for i in soup.select('.mw-content-ltr .mw-content-ltr a')] # get politicians names

关于python - 通过抓取收集信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60905643/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com