gpt4 book ai didi

Python Scrapy 只一遍又一遍地抓取相同的元素

转载 作者:行者123 更新时间:2023-12-01 03:08:54 25 4
gpt4 key购买 nike

我正在尝试学习 Scrapy,并且正在 yelp 网站上学习这个LINK但是当 scrapy 运行时,它会一遍又一遍地抓取相同的电话、地址,而不是抓取不同的部分。我使用的选择器是属于页面每个餐厅的特定类的所有“li”标签,每个li标签包含我使用适当选择器的每个餐厅信息,但scrapy只给我2或3家餐厅重复的结果。由于某些原因,Scrapy 一遍又一遍地使用相同的部分,而当它们在 for 循环中完成后它应该跳过它们。 这是代码

    try:
import scrapy
from urlparse import urljoin
except ImportError:
print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"

#scrapy.optional_features.remove('boto')

url = raw_input('ENTER THE SITE URL : ')

class YelpSpider(scrapy.Spider):
name = 'yelp spider'
start_urls = [url]

def parse(self, response):
SET_SELECTOR = '.regular-search-result'

#Going over each li tags containg each resturant belonging to this class

for yelp in response.css(SET_SELECTOR):

#getting a slector to get a link to scrape website info from another page
selector = '.indexed-biz-name a ::attr(href)'

#getting the complete url joining the extracted part
momo = urljoin(response.url, yelp.css(selector).extract_first())

#All the selectors
name = '.indexed-biz-name a span ::text'
services = '.category-str-list a ::text'
address1 = '.neighborhood-str-list ::text'
address2 = 'address ::text'
phone = '.biz-phone ::text'

# extracting them and adding them in a dict
try:
add1 = response.css(address1).extract_first().replace('\n','').replace('\n','')
add2 = response.css(address2).extract_first().replace('\n','').replace('\n','')
ADDRESS = add1 + ' ' + add2

pookiebanana = {

"PHONE": response.css(phone).extract_first().replace('\n','').replace('\t',''),
"NAME": response.css(name).extract_first().replace('\n','').replace('\t',''),
"SERVICES": response.css(services).extract_first().replace('\n','').replace('\t',''),
"ADDRESS": ADDRESS,
}
except:
pass

#Opening another page passing the old dict
Post = scrapy.Request(momo, callback=self.parse_yelp, meta={'item': pookiebanana})

#yielding the dict with the website scraped
yield Post

#Clicking the next button and recursively calling the same function with the same link
NEXT_PAGE_SELECTOR = '.u-decoration-none.next.pagination-links_anchor ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)

def parse_yelp(self, response):
#Website selector opening a new page from the link we extracted
WEBSITE_SELECTOR = '.biz-website.js-add-url-tagging a ::text'

item = response.meta['item']

#inside the try block extracting the website info and returning the modified dict
try:
item['WEBSITE'] = ' '.join(response.css(WEBSITE_SELECTOR).extract_first().split(' '))
except:
pass
return item

我在代码中广泛评论了我在哪里做了什么。我做错了什么?

这是输出 csv 屏幕截图,显示了重复次数 PICTURE

这里是 scrapy 抓取输出,您可以看到它一遍又一遍地抓取相同的内容 PIC发生了什么以及我做错了什么?

最佳答案

我无法测试它,但在 for yelp 循环中,您应该使用 yelp.css() 但您使用 response.css()

关于Python Scrapy 只一遍又一遍地抓取相同的元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43091965/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com