gpt4 book ai didi

javascript - 跟踪页面的每个链接并抓取内容,Scrapy + Selenium

转载 作者:行者123 更新时间:2023-11-27 23:25:52 26 4
gpt4 key购买 nike

This是我正在开发的网站。每页有一个表格,共 18 个帖子。我想访问每篇文章并抓取其内容,并对前 5 页重复此操作。

我的方法是让我的蜘蛛抓取 5 个页面中的所有链接并迭代它们以获取内容。因为“下一页”按钮和每篇文章中的某些文本是由 JavaScript 编写的,所以我使用 Selenium 和 Scrapy。我运行我的蜘蛛,可以看到 Firefox webdriver 显示前 5 页,但随后蜘蛛停止了,没有抓取任何内容。 Scrapy 也不会返回错误消息。

现在我怀疑失败的原因可能是:

1) 没有链接存储到 all_links 中。

2) 不知何故,parse_content 没有运行。

我的诊断可能是错误的,我需要帮助来查找问题。非常感谢!

这是我的蜘蛛:

import scrapy
from bjdaxing.items_bjdaxing import BjdaxingItem
from selenium import webdriver
from scrapy.http import TextResponse
import time

all_links = [] # a global variable to store post links


class Bjdaxing(scrapy.Spider):
name = "daxing"

allowed_domains = ["bjdx.gov.cn"] # DO NOT use www in allowed domains
start_urls = ["http://app.bjdx.gov.cn/cms/daxing/lookliuyan_bjdx.jsp"] # This has to start with http

def __init__(self):
self.driver = webdriver.Firefox()

def parse(self, response):
self.driver.get(response.url) # request the start url in the browser

i = 1

while i <= 5: # The number of pages to be scraped in this session

response = TextResponse(url = response.url, body = self.driver.page_source, encoding='utf-8') # Assign page source to response. I can treat response as if it's a normal scrapy project.

global all_links
all_links.extend(response.xpath("//a/@href").extract()[0:18])

next = self.driver.find_element_by_xpath(u'//a[text()="\u4e0b\u9875\xa0"]') # locate "next" button
next.click() # Click next page
time.sleep(2) # Wait a few seconds for next page to load.

i += 1


def parse_content(self, response):
item = BjdaxingItem()
global all_links
for link in all_links:
self.driver.get("http://app.bjdx.gov.cn/cms/daxing/") + link

response = TextResponse(url = response.url, body = self.driver.page_source, encoding = 'utf-8')

if len(response.xpath("//table/tbody/tr[1]/td[2]/text()").extract() > 0):
item['title'] = response.xpath("//table/tbody/tr[1]/td[2]/text()").extract()
else:
item['title'] = ""

if len(response.xpath("//table/tbody/tr[3]/td[2]/text()").extract() > 0):
item['netizen'] = response.xpath("//table/tbody/tr[3]/td[2]/text()").extract()
else:
item['netizen'] = ""

if len(response.xpath("//table/tbody/tr[3]/td[4]/text()").extract() > 0):
item['sex'] = response.xpath("//table/tbody/tr[3]/td[4]/text()").extract()
else:
item['sex'] = ""

if len(response.xpath("//table/tbody/tr[5]/td[2]/text()").extract() > 0):
item['time1'] = response.xpath("//table/tbody/tr[5]/td[2]/text()").extract()
else:
item['time1'] = ""

if len(response.xpath("//table/tbody/tr[11]/td[2]/text()").extract() > 0):
item['time2'] = response.xpath("//table/tbody/tr[11]/td[2]/text()").extract()
else:
item['time2'] = ""

if len(response.xpath("//table/tbody/tr[7]/td[2]/text()").extract()) > 0:
question = "".join(response.xpath("//table/tbody/tr[7]/td[2]/text()").extract())
item['question'] = "".join(map(unicode.strip, question))
else: item['question'] = ""

if len(response.xpath("//table/tbody/tr[9]/td[2]/text()").extract()) > 0:
reply = "".join(response.xpath("//table/tbody/tr[9]/td[2]/text()").extract())
item['reply'] = "".join(map(unicode.strip, reply))
else: item['reply'] = ""

if len(response.xpath("//table/tbody/tr[13]/td[2]/text()").extract()) > 0:
agency = "".join(response.xpath("//table/tbody/tr[13]/td[2]/text()").extract())
item['agency'] = "".join(map(unicode.strip, agency))
else: item['agency'] = ""

yield item

最佳答案

此处存在多个问题和可能的改进:

  • parse()parse_content() 方法之间没有任何“链接”
  • 使用全局变量通常是一种不好的做法
  • 这里根本不需要selenium。要跟踪分页,您只需向同一网址发出 POST 请求并提供 currPage 参数

这个想法是使用.start_requests()并创建请求列表/队列来处理分页。按照分页并从表中收集链接。一旦请求队列为空,请切换到之前收集的链接。实现:

import json
from urlparse import urljoin

import scrapy


NUM_PAGES = 5

class Bjdaxing(scrapy.Spider):
name = "daxing"

allowed_domains = ["bjdx.gov.cn"] # DO NOT use www in allowed domains

def __init__(self):
self.pages = []
self.links = []

def start_requests(self):
self.pages = [scrapy.Request("http://app.bjdx.gov.cn/cms/daxing/lookliuyan_bjdx.jsp",
body=json.dumps({"currPage": str(page)}),
method="POST",
callback=self.parse_page,
dont_filter=True)
for page in range(1, NUM_PAGES + 1)]

yield self.pages.pop()

def parse_page(self, response):
base_url = response.url
self.links += [urljoin(base_url, link) for link in response.css("table tr td a::attr(href)").extract()]

try:
yield self.pages.pop()
except IndexError: # no more pages to follow, going over the gathered links
for link in self.links:
yield scrapy.Request(link, callback=self.parse_content)

def parse_content(self, response):
# your parse_content method here

关于javascript - 跟踪页面的每个链接并抓取内容,Scrapy + Selenium,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34968262/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com