gpt4 book ai didi

python - 用 Beautiful soup 抓取内部链接

转载 作者:太空宇宙 更新时间:2023-11-03 13:01:33 25 4
gpt4 key购买 nike

我编写了一个 python 代码来获取与给定 url 对应的网页,并将该页面上的所有链接解析为链接存储库。接下来,它从刚刚创建的存储库中获取任何 url 的内容,将新内容中的链接解析到存储库中,并对存储库中的所有链接继续此过程,直到停止或获取给定数量的链接后。

这里代码:

import BeautifulSoup
import urllib2
import itertools
import random


class Crawler(object):
"""docstring for Crawler"""

def __init__(self):

self.soup = None # Beautiful Soup object
self.current_page = "http://www.python.org/" # Current page's address
self.links = set() # Queue with every links fetched
self.visited_links = set()

self.counter = 0 # Simple counter for debug purpose

def open(self):

# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)

# Fetch every links
self.soup = BeautifulSoup.BeautifulSoup(html_code)

page_links = []
try :
page_links = itertools.ifilter( # Only deal with absolute links
lambda href: 'http://' in href,
( a.get('href') for a in self.soup.findAll('a') ) )
except Exception: # Magnificent exception handling
pass



# Update links
self.links = self.links.union( set(page_links) )



# Choose a random url from non-visited set
self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
self.counter+=1


def run(self):

# Crawl 3 webpages (or stop if all url has been fetched)
while len(self.visited_links) < 3 or (self.visited_links == self.links):
self.open()

for link in self.links:
print link



if __name__ == '__main__':

C = Crawler()
C.run()

此代码不获取内部链接(仅获取绝对形式的超链接)

如何获取以“/”或“#”或“.”开头的内部链接

最佳答案

好吧,您的代码已经告诉您发生了什么。在您的 lambda 中,您只获取以 http://开头的绝对链接(您没有获取 https FWIW)。您应该获取所有链接并检查它们是否以 http+ 开头。如果没有,则它们是相对链接,并且由于您知道 current_page 是什么,因此您可以使用它来创建绝对链接。

这是对您的代码的修改。请原谅我的 Python,因为它有点生疏,但我运行了它,它在 Python 2.7 中为我工作。你会想要清理它并添加一些边缘/错误检测,但你得到了要点:

#!/usr/bin/python

from bs4 import BeautifulSoup
import urllib2
import itertools
import random
import urlparse


class Crawler(object):
"""docstring for Crawler"""

def __init__(self):
self.soup = None # Beautiful Soup object
self.current_page = "http://www.python.org/" # Current page's address
self.links = set() # Queue with every links fetched
self.visited_links = set()

self.counter = 0 # Simple counter for debug purpose

def open(self):

# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)

# Fetch every links
self.soup = BeautifulSoup(html_code)

page_links = []
try :
for link in [h.get('href') for h in self.soup.find_all('a')]:
print "Found link: '" + link + "'"
if link.startswith('http'):
page_links.append(link)
print "Adding link" + link + "\n"
elif link.startswith('/'):
parts = urlparse.urlparse(self.current_page)
page_links.append(parts.scheme + '://' + parts.netloc + link)
print "Adding link " + parts.scheme + '://' + parts.netloc + link + "\n"
else:
page_links.append(self.current_page+link)
print "Adding link " + self.current_page+link + "\n"

except Exception, ex: # Magnificent exception handling
print ex

# Update links
self.links = self.links.union( set(page_links) )

# Choose a random url from non-visited set
self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
self.counter+=1

def run(self):

# Crawl 3 webpages (or stop if all url has been fetched)
while len(self.visited_links) < 3 or (self.visited_links == self.links):
self.open()

for link in self.links:
print link

if __name__ == '__main__':
C = Crawler()
C.run()

关于python - 用 Beautiful soup 抓取内部链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19168220/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com