gpt4 book ai didi

python - Scrapy 或 BeautifulSoup 从各种网站上抓取链接和文本

转载 作者:太空宇宙 更新时间:2023-11-04 05:18:04 25 4
gpt4 key购买 nike

我试图从输入的 URL 中抓取链接,但它只适用于一个 url ( http://www.businessinsider.com )。它如何适应从输入的任何 url 中抓取?我正在使用 BeautifulSoup,但 Scrapy 更适合这个吗?

def WebScrape():  
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")

if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')

最佳答案

您可以制作一个更通用的抓取工具,搜索所有标签和这些标签内的所有链接。获得所有链接的列表后,您可以使用正则表达式或类似表达式来查找与所需结构匹配的链接。

import requests
from bs4 import BeautifulSoup
import re

response = requests.get('http://www.businessinsider.com')

soup = BeautifulSoup(response.content)

# find all tags
tags = soup.find_all()

links = []

# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)

# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)

关于python - Scrapy 或 BeautifulSoup 从各种网站上抓取链接和文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41202526/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com