gpt4 book ai didi

Python网络抓取工具,具有不同文本的相同链接,计数

转载 作者:太空宇宙 更新时间:2023-11-03 18:51:10 24 4
gpt4 key购买 nike

所以我用 Python 及其一些库制作了一个网络爬虫...它会转到给定的站点并从该站点上的链接获取所有链接和文本。我已经过滤了结果,因此只打印该网站上的外部链接。

代码如下所示:

import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
from urlparse import urlsplit
from publicsuffix import PublicSuffixList

link = "http://www.ananda-pur.de/23.html"

newesturlDict = {}
baseAdrInsArray = []



br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(link, timeout=10)


for linkins in br.links():

newesturl = urlparse.urljoin(linkins.base_url, linkins.url)

linkTxt = linkins.text
baseAdrIns = linkins.base_url

if baseAdrIns not in baseAdrInsArray:
baseAdrInsArray.append(baseAdrIns)

netLocation = urlsplit(baseAdrIns)
psl = PublicSuffixList()
publicAddress = psl.get_public_suffix(netLocation.netloc)

if publicAddress not in newesturl:

if newesturl not in newesturlDict:
newesturlDict[newesturl,linkTxt] = 1
if newesturl in newesturlDict:
newesturlDict[newesturl,linkTxt] += 1

newesturlCount = sorted(newesturlDict.items(),key=lambda(k,v):(v,k),reverse=True)
for newesturlC in newesturlCount:
print baseAdrInsArray[0]," - ",newesturlC[0],"- count: ", newesturlC[1]

打印出这样的结果:

http://www.ananda-pur.de/23.html  -  ('http://www.yogibhajan.com/',  'http://www.yogibhajan.com') - count:  1
http://www.ananda-pur.de/23.html - ('http://www.kundalini-yoga-zentrum-berlin.de/', 'http://www.kundalini-yoga-zentrum-berlin.de') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.sat-nam-rasayan.de') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.kriteachings.org') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.gurudevsnr.com') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.3ho.de') - count: 1

我的问题是那些具有不同文本的相同链接。根据打印示例,给定网站有 4 个链接 http://www.kriteachings.org/ 但正如您所看到的,这 4 个链接中的每一个都有不同的 text:第 1 个是 http://www.sat-n​​am-rasayan.de,第二个是 http://www.kriteachings.org,第三个是 http://www.gurudevsnr.com,第四个是 http://www.3ho.de

我想获得打印结果,在其中我可以看到链接在给定页面上的时间,但如果有不同的链接文本,它只会附加到同一链接的其他文本。为了说明这个例子的要点,我想得到这样的打印:

http://www.ananda-pur.de/23.html  -  http://www.yogibhajan.com/ - http://www.yogibhajan.com - count:  1
http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de - http://www.kundalini-yoga-zentrum-berlin.de - count: 1
http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.sat-nam-rasayan.de, http://www.kriteachings.org, http://www.gurudevsnr.com, http://www.3ho.de - count: 4

说明:

(first link is given page, second is founded link, third link is acutally text of that founded link, and 4th item is how many times that link appear on given site)

我的主要问题是我不知道如何比较?!,排序?!或者告诉程序这是同一个链接,并且应该附加不同的文本。

不需要太多代码就可以实现类似的功能吗?我是 python nooby,所以我有点迷失..

欢迎任何帮助或建议

最佳答案

将链接收集到字典中,收集链接文本并处理计数:

import cookielib

import mechanize


base_url = "http://www.ananda-pur.de/23.html"

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)

links = {}
for link in br.links():
if link.url not in links:
links[link.url] = {'count': 1, 'texts': [link.text]}
else:
links[link.url]['count'] += 1
links[link.url]['texts'].append(link.text)

# printing
for link, data in links.iteritems():
print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

打印:

http://www.ananda-pur.de/23.html - index.html - Zadekstr 11,12351 Berlin, - 2
http://www.ananda-pur.de/23.html - 28.html - Das Team - 1
http://www.ananda-pur.de/23.html - http://www.yogibhajan.com/ - http://www.yogibhajan.com - 1
http://www.ananda-pur.de/23.html - 24.html - Kontakt - 1
http://www.ananda-pur.de/23.html - 25.html - Impressum - 1
http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.kriteachings.org,http://www.gurudevsnr.com,http://www.sat-nam-rasayan.de,http://www.3ho.de - 4
http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de/ - http://www.kundalini-yoga-zentrum-berlin.de - 1
http://www.ananda-pur.de/23.html - 3.html - Ergo Oranien 155 - 1
http://www.ananda-pur.de/23.html - 2.html - Physio Bänsch 36 - 1
http://www.ananda-pur.de/23.html - 13.html - Stellenangebote - 1
http://www.ananda-pur.de/23.html - 23.html - Links - 1

关于Python网络抓取工具,具有不同文本的相同链接,计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18351916/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com