gpt4 book ai didi

python - 如何使用 beautifulsoup 递归查找网页中的所有链接?

转载 作者:行者123 更新时间:2023-12-05 04:09:05 24 4
gpt4 key购买 nike

我一直在尝试使用我发现的一些代码 in this answer递归查找给定 URL 的所有链接:

import urllib2
from bs4 import BeautifulSoup

url = "http://francaisauthentique.libsyn.com/"

def recursiveUrl(url,depth):

if depth == 5:
return url
else:
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
newlink = soup.find('a') #find just the first one
if len(newlink) == 0:
return url
else:
return url, recursiveUrl(newlink,depth+1)


def getLinks(url):
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
links = soup.find_all('a')
for link in links:
links.append(recursiveUrl(link,0))
return links

links = getLinks(url)
print(links)

除了警告

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this:

BeautifulSoup(YOUR_MARKUP})

to this:

BeautifulSoup(YOUR_MARKUP, "lxml")

我收到以下错误:

Traceback (most recent call last):
File "downloader.py", line 28, in <module>
links = getLinks(url)
File "downloader.py", line 25, in getLinks
links.append(recursiveUrl(link,0))
File "downloader.py", line 11, in recursiveUrl
page=urllib2.urlopen(url)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 396, in open
protocol = req.get_type()
TypeError: 'NoneType' object is not callable

问题是什么?

最佳答案

您的 recursiveUrl 尝试访问一个无效的 url 链接,例如:/webpage/category/general,这是您从 href 链接之一中提取的值。

您应该将提取的 href 值附加到网站的 url,然后尝试打开该网页。您将需要研究递归算法,因为我不知道您想要实现什么。

代码:

import requests
from bs4 import BeautifulSoup

def recursiveUrl(url, link, depth):
if depth == 5:
return url
else:
print(link['href'])
page = requests.get(url + link['href'])
soup = BeautifulSoup(page.text, 'html.parser')
newlink = soup.find('a')
if len(newlink) == 0:
return link
else:
return link, recursiveUrl(url, newlink, depth + 1)

def getLinks(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.find_all('a')
for link in links:
links.append(recursiveUrl(url, link, 0))
return links

links = getLinks("http://francaisauthentique.libsyn.com/")
print(links)

输出:

http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/10
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/09
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/08
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/07
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general

关于python - 如何使用 beautifulsoup 递归查找网页中的所有链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46629681/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com