python-3.x - 我如何为维基百科页面构建一个基本的网络爬虫来收集链接？-6ren

python-3.x - 我如何为维基百科页面构建一个基本的网络爬虫来收集链接？

转载作者：行者123 更新时间：2023-12-03 09:52:36

我一直在为初学者观看关于 Python 的 bucky roberts 视频，我正在尝试使用视频中类似类型的代码为维基百科页面构建一个基本的网络爬虫。

import requests
from bs4 import BeautifulSoup

def main_page_spider(max_pages):
page_list={1: "Contents",
           2:"Overview",
           3:"Outlines",
           4:"Lists",
           5:"Portals",
           6:"Glossaries",
           7:"Categories",
           8:"Indices",
           9:"Reference",
           10:"Culture",
           11:"Geography",
           12:"Health",
           13:"History",
           14:"Mathematics",
           15:"Nature",
           16:"People",
           17:"Philosophy",
           18:"Religion",
           19:"Society",
           20:"Technology"}
    for page in range(1,max_pages+1):
        if page == 1:
            url = "https://en.wikipedia.org/wiki/Portal:Contents"
        else:
             url = "https://en.wikipedia.org/wiki/Portal:Contents/" + str(page_list[page])
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        divs = soup.find('div', {'class': "mw-body-content", 'id': "bodyContent"})

        for link in divs.findAll('a'):
            href = "https://en.wikipedia.org" + str(link.get("href"))
            get_link_data(href)
            print(href)

def get_link_data(link_url):
    source_code = requests.get(link_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    divs = soup.find('div',{'class': "mw-body-content", 'id': "bodyContent"})
    for link in divs.findAll('a'):
        link_href_data = link.get("href")
        print(link_href_data)

main_page_spider(3)

问题是当我注释掉 get_link_data() 的函数调用时该程序运行良好，我从我定义的页数中获得了所有链接。
但是，当我取消注释它时，程序会收集很少的链接并给我错误，例如

socket.gaierror,urllib3.exceptions.NewConnectionError,urllib3.exceptions.MaxRetryError,requests.exceptions.ConnectionError

我该如何解决？

最佳答案

任何时候你在抓取你都应该引入延迟，以免淹没站点的资源 - 或者你自己的资源。使用 get_link_data 运行您的脚本正如您所描述的，注释掉的行会产生 2763 行输出。您将尽快抓取 2763 个 URL。这通常会触发错误，无论是来自限制您的站点还是来自您自己的网络或您的 DNS 服务器被阻塞。

在每次调用 get_link_data 之前添加延迟- 我建议至少一秒钟。这将需要一段时间，但请记住 - 您正在从免费可用的资源中收集数据。不要滥用它。

您还应该对所关注的链接更有选择性。在 2763 个 URL 输出中，只有 2291 个唯一的 - 这几乎是 500 个页面，您将抓取两次。跟踪您已经处理过的 URL，不要再次请求它们。

您可以进一步细化 - 大约 100 个 URL 包含片段( # 之后的部分)。当像这样抓取时，片段通常应该被忽略——它们通常只会引导浏览器关注哪里。如果删除 #以及每个 URL 之后的所有内容，您将获得 2189 个唯一页面。

您提出的某些链接格式也有误。它们看起来像这样:

https://en.wikipedia.org//en.wikipedia.org/w/index.php?title=Portal:Contents/Outlines/Society_and_social_sciences&action=edit

您可能想要修复这些 - 并且可能完全跳过“编辑”链接。

最后，即使你做了所有这些事情，你也可能会遇到一些异常(exception)。互联网是一个困惑的地方 :) 所以你需要包含错误处理。沿着这些路线的东西:

for link in divs.findAll('a'):
    href = "https://en.wikipedia.org" + str(link.get("href"))
    time.sleep(1)
    try:
        get_link_data(href)
    except Exception as e:
        print("Failed to get url {}\nError: {}".format(href, e.__class__.__name__)

关于python-3.x - 我如何为维基百科页面构建一个基本的网络爬虫来收集链接？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51445418/

文章推荐： windows-xp - Windows XP 中用户启动的内核转储

文章推荐： r - 根据向量生成重复序列

文章推荐： Tensorflow 对象检测掩码 rcnn 使用太多内存

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python-3.x - 我如何为维基百科页面构建一个基本的网络爬虫来收集链接？