gpt4 book ai didi

python - 仅在scrapy中返回特定url

转载 作者:行者123 更新时间:2023-11-30 22:39:46 25 4
gpt4 key购买 nike

我正在使用 scrapy 从网站上抓取网址。目前它返回所有网址,但我希望它只返回包含“下载”一词的网址。我怎样才能做到这一点?

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy

DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN)

class MySpider(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]

def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
print url
yield Request(url, callback=self.parse)

编辑:

我实现了以下建议。它仍然会抛出一些错误,但至少这只会返回包含下载的链接。

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor


DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN)

class MySpider(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]

# First parse returns all the links of the website and feeds them to parse2

def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
yield Request(url, callback=self.parse2)

# Second parse selects only the links that contains download

def parse2(self, response):
le = LinkExtractor(allow=("download"))
for link in le.extract_links(response):
yield Request(url=link.url, callback=self.parse2)
print link.url

最佳答案

一个更Pythonic和干净的解决方案,将使用LinkExtractor:

from scrapy.linkextractors import LinkExtractor

...

le = LinkExtractor(deny="download")
for link in le.extract_links(response):
yield Request(url=link.url, callback=self.parse)

关于python - 仅在scrapy中返回特定url,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43051411/

25 4 0
文章推荐: python - 如何在已通过正则表达式过滤的 pandas DataFrame 上使用 .apply 函数?
文章推荐: python - Pandas 中基于规则的列重命名
文章推荐: c# - 如何使用 C# 从网页打开事件查看器?
文章推荐: python - 如何使用 BeautifulSoup 更改 HTML
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com