gpt4 book ai didi

python - 使用python从网页中提取所有链接

转载 作者:太空宇宙 更新时间:2023-11-04 10:17:13 26 4
gpt4 key购买 nike

在 Udacity 的计算机科学类(class)简介之后,我正在尝试制作一个 python 脚本来从页面中提取链接,下面是我使用的代码:

出现以下错误

NameError: name 'page' is not defined

代码如下:

def get_page(page):
try:
import urllib
return urllib.urlopen(url).read()
except:
return ''

start_link = page.find('<a href=')
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]

def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
return (None, 0)
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return (url, end_quote)

(url, end_pos) = get_next_target(page)

page = page[end_pos:]

def print_all_links(page):
while True:
(url, end_pos) = get_next_target(page)
if url:
print(url)
page = page[:end_pos]
else:
break

print_all_links(get_page("http://xkcd.com/"))

最佳答案

page 未定义,这是错误的原因。

对于像这样的网页抓取,你可以简单地使用beautifulSoup:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://stackoverflow.com/"

page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
print(link.get('href'))

关于python - 使用python从网页中提取所有链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34610162/

26 4 0
文章推荐: linux - 将 linux 命令的字符串输出传输到 php 服务器
文章推荐: Java - 类型不匹配 : cannot convert from ImmutableList to ImmutableList