python - PhantomJS 不提取链接 Selenium-6ren

python - PhantomJS 不提取链接 Selenium

转载作者：行者123 更新时间：2023-12-01 04:28:15

我正在使用 Selenium 、 Scrapy 和 PhantomJS 抓取网站。代码的问题是，尽管代码完美地滚动页面，但它仅提取一定限制的链接。除此之外，它完全忽略滚动的结果。当我使用 Firefox Webdriver 时，它工作正常。由于我在服务器中运行代码，因此我使用了 PhantomJS，因此遇到了问题。下面是代码:

# -*- coding: utf-8 -*-

from scrapy.spider import BaseSpider
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
import re
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait


class DukeSpider(BaseSpider):
 name = "dspider"
 allowed_domains = ["dukemedicine.org"]
 start_urls = ["http://www.dukemedicine.org/find-doctors-physicians"]  #hlor


 def __init__(self):
    self.driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
    self.driver.maximize_window()
    print 'here'


 def parse(self, response):

    print 'nowhere'
    print response
    print response.url
    b = open('doc_data_duke.csv', 'a')
    a = csv.writer(b, lineterminator='\n')
    print 'a'

    self.driver.get(response.url)
    time.sleep(10)
    wait = WebDriverWait(self.driver, 10)
    print 'helo'

    click = self.driver.find_element_by_xpath("//span[@id='specialty']")
    click.click()
    click_again = self.driver.find_element_by_xpath("//ul[@class='doctor-type']/li[@class='ng-binding ng-scope'][2]")

    click_again.click()
    time.sleep(25)

    act = ActionChains(self.driver)
    act.move_to_element(self.driver.find_element_by_id('doctor-matrix-section')).click()
    print 'now here'

    for i in range(0, 75):  
        #self.driver.find_element_by_xpath("//div[@id='doctor-matrix-section']").send_keys(Keys.PAGE_DOWN)
        #self.driver.execute_script("window.scrollBy(0, document.body.scrollHeight);")
        #self.driver.find_element_by_tag_name("body").click()
        #self.driver.find_element_by_tag_name("body").send_keys(Keys.PAGE_DOWN)#findElement(By.tagName("body")).sendKeys(Keys.UP);
        #self.driver.find_element_by_tag_name("body").send_keys(Keys.END)
        #bg = self.driver.find_element_by_css_selector('body')

        #bg.send_keys(Keys.SPACE)
        act.send_keys(Keys.PAGE_DOWN).perform()
        time.sleep(2)

        print i
        i += 1

    links = self.driver.find_elements_by_xpath("//div[@class = 'result-information']/div[@class='name']/a")

    for l in links:
        print l
        doc_list = l.get_attribute('href')
        if re.match(r'https:\/\/www\.dukemedicine\.org\/find-doctors-physicians\/#!\/(.*)', doc_list):
            print doc_list
            dr = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
            dr.maximize_window()

            dr.get(doc_list)

            try:
                name_title = dr.find_element_by_xpath('//div[@class="header1 ng-binding"]').text
                name_titles = name_title.split(",", 1)
                name = name_titles[0].encode('utf-8')

                title = name_titles[1]
                print name.encode('utf-8')
                title = title[1:].encode('utf-8')
                print title.encode('utf-8')
            except:
                name = ''
                title = ''
            try:
                speciality = dr.find_element_by_xpath('//p[@class="specialties ng-scope"]').text

            except:
                speciality = ''

            try:
                language = dr.find_element_by_xpath(
                    '//div[@class="lang ng-scope"]/div[@class="plainText inline ng-binding"]').text
            except:
                language = ''
            if dr.find_elements_by_xpath('//div[@class="location-info"]'):
                locations = dr.find_elements_by_xpath('//div[@class="location-info"]')
                if len(locations) >= 3:
                    locationA = locations[0].text.encode('utf-8')
                    locationA = locationA.replace('Directions', '')
                    locationA = locationA.replace('\n', '')
                    locationB = locations[1].text.encode('utf-8')
                    locationB = locationB.replace('Directions', '')
                    locationB = locationB.replace('\n', '')
                    locationC = locations[2].text.encode('utf-8')
                    locationC = locationC.replace('\n', '')
                    locationC = locationC.replace('Directions', '')
                elif len(locations) == 2:
                    locationA = locations[0].text.encode('utf-8')
                    locationA = locationA.replace('Directions', '')
                    locationA = locationA.replace('\n', '')
                    locationB = locations[1].text.encode('utf-8')
                    locationB = locationA.replace('Directions', '')
                    locationB = locationB.replace('\n', '')
                    locationC = ''
                elif len(locations) == 1:
                    locationA = locations[0].text.encode('utf-8')
                    locationA = locationA.replace('Directions', '')
                    locationA = locationA.replace('\n', '')
                    locationB = ''
                    locationC = ''
            else:
                locationA = ''
                locationB = ''
                locationC = ''

            dr.close()
            data = [title, name, speciality, language, locationA, locationB, locationC]
            print 'aaaa'
            print data
            a.writerow(data)

无论我在范围内设置哪个更高的值，它都会忽略超出某个点的结果。

最佳答案

让我们使用这样一个事实:有一个元素具有总结果计数:

这个想法是迭代scroll into view最后找到的医生，直到我们加载完所有医生。

实现(带有澄清注释，仅保留相关的“selenium”特定部分):

# -*- coding: utf-8 -*-
import time

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException


driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--load-images=false'])
# driver = webdriver.Chrome()
driver.maximize_window()

driver.get("http://www.dukemedicine.org/find-doctors-physicians")

# close optional survey popup if exists
try:
    driver.find_element_by_css_selector("area[alt=close]").click()
except NoSuchElementException:
    pass

# open up filter dropdown
click = driver.find_element_by_id("specialty")
click.click()

# choose specialist
specialist = driver.find_element_by_xpath("//ul[@class = 'doctor-type']/li[contains(., 'specialist')]")
specialist.click()

# artificial delay: TODO: fix?
time.sleep(15)

# read total results count
total_count = int(driver.find_element_by_id("doctor-number").text)

# get the initial results count
results = driver.find_elements_by_css_selector("div.doctor-result")
current_count = len(results)

# iterate while all of the results would not be loaded
while current_count < total_count:
    driver.execute_script("arguments[0].scrollIntoView();", results[-1])

    results = driver.find_elements_by_css_selector("div.doctor-result")
    current_count = len(results)
    print "Current results count: %d" % current_count

# report total results
print "----"
print "Total results loaded: %d" % current_count

driver.quit()

在 PhantomJS 和 Chrome 中都非常适合我。这是我在控制台上得到的内容:

Current results count: 36
Current results count: 54
Current results count: 72
Current results count: 90
...
Current results count: 1656
Current results count: 1674
Current results count: 1692
Current results count: 1708
----
Total results loaded: 1708

另外请注意，我添加了 --load-images=false 命令行参数，实际上可以显着加快速度。

关于python - PhantomJS 不提取链接 Selenium，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32787568/

文章推荐： jsf - 如何使用

? 绑定(bind)动态内容

文章推荐： java - 颜色函数: darken and lighten

文章推荐： jquery - 有问题的 jQuery 动画代码？

文章推荐： sql-server - 未使用的 SAP，来自 SQL SERVER 2008 的表列表

phantomjs - 在指定路径找不到 phantomjs
我正在尝试开始使用 Mermaid CLI，但是当我尝试针对我的源文件运行它时，它说找不到 phantomjs。 (我运行的是 Win 7 64。) C:\Users\Chris\Documents>
phantomjs - PhantomJS:调用电话时指定用户代理
我正在使用PhantomJS来调用网页，如下所示: page.open('http://example.com', function (s) { console.log(page.content)
phantomjs - phantomjs 会定期自动清除缓存吗？
如果有这样的功能，我需要更改Phantomjs自动清除缓存的默认时间。有什么想法吗？最佳答案应该是您正在寻找的功能: https://github.com/ariya/phantomjs/issu
phantomjs - 运行脚本后自动关闭 PhantomJs
我想从我的程序运行 PhantomJs 脚本，但由于脚本可能不是我写的，我需要确保 PhantomJs 在执行完成或因任何原因(例如无效语法、超时、 ETC)。到目前为止，我读到的所有内容都说你必须始
phantomjs - Karma Jasmine PhantomJS- PhantomJS 没有在 60000 毫秒内捕获
在 Package JSON 中，我尝试了许多不同版本的 karma-phantomjs-launcher、phantomjs，包括 phantomjs-prebuilt。当前包 JSON “开发依
phantomjs - 如何调试 PhantomJS 脚本？
我的脚本有一些语法错误，但 PhantomJS 没有显示任何错误，而是没有显示任何内容。如果脚本有错误，为什么 Phantom JS 不显示解析错误？在以下 PhantomJS 脚本(通过 Wind
phantomjs - CasperJS/PhantomJS 如何保持旧页面打开？
我有一些需要填写的动态输入表单。问题是要填写表格，我需要访问另一个页面以获取取决于上一页输入的数据。因此，在我获得数据然后返回表单后，表单已经更改，因此我需要在获取数据时保持该表单打开。那么问题是如何
phantomjs - 根据内容裁剪 PhantomJS 屏幕截图
PhantomJS 在为我捕获网页到图像文件方面做得很好。我正在使用基于 rasterize.js 的脚本。但是，对于某些固定大小的 Web 元素，我需要生成的图像与 Web 元素的大小相匹配。例
phantomjs - 仅在发生客户端重定向后如何结束 PhantomJS 脚本
我正在将 PhantomJS headless 浏览器集成到我的一个项目中(目前使用 1.6 版)。在大多数情况下，它在完成我需要完成的工作方面做得很好。但是，WebPage.open() 调用工作方
phantomjs - 将变量传递到page.evaluate-PhantomJS
是否可以在page.evaluate中传递变量？ function myFunction(webpage, arg1, arg2){ var page = require('webpage').cre
phantomjs - 可靠地检测基于 PhantomJS 的垃圾邮件机器人
有没有办法始终如一地检测 PhantomJS/CasperJS？我一直在处理用它构建的一系列恶意垃圾邮件机器人，并且能够根据某些行为基本上阻止它们，但是我很好奇是否有一种坚如磐石的方法来了解 Casp
phantomjs - 在 PhantomJS 中使用自定义响应拦截请求？
有没有办法拦截资源请求并直接从处理程序给出响应？像这样的事情: page.onRequest(function(request){ request.reply({data: 123}); });
phantomjs - 如何控制 PhantomJS 跳过下载某种资源？
phantomjs 有配置 loadImage，但我想要更多，如何控制phantomjs跳过下载某种资源，比如css等... ===== 好消息:已添加此功能。 https://code.goo
phantomjs - 在 PhantomJS 中禁用内容安全策略
我正在尝试在 PhantomJS (2.1.1) 的 page.evaulate() 调用中使用 WebSocket。当尝试连接到 WebSocket 服务器时，出现以下错误: 安全错误:DOM 异常
phantomjs - 为 phantomjs 的每个实例指定不同的缓存目录
我正在使用 PhantomJS 1.8，但遇到了一个限制——您无法指定它用于磁盘缓存的目录。我将其添加到他们的问题跟踪系统中，但由于以前没有它，所以我不希望它很快添加。因此，我正在寻找解决此限制的方
phantomjs - 如何使用 phantomJs 滚动页面
我想渲染一个仅在用户滚动页面时加载图像的页面。仅设置 page.scrollPosition 没有任何效果。我需要一些可以随时间改变滚动位置的东西。最佳答案不确定这是否是最好的方法，但它确实有效。
phantomjs - 使用 PhantomJS 设置远程调试
我正在尝试使用 PhantomJS 设置远程调试，但运气不佳。我按照 https://github.com/ariya/phantomjs/wiki/Troubleshooting 上的说明进行操作。
phantomjs - 调试 PhantomJS 网页打开失败
在 PhantomJS 中，webpage.open 会使用状态参数设置为“成功”或“失败”的回调。根据文档，如果没有发生网络错误，则“成功”，否则“失败”。有没有办法查看导致失败的底层网络错误？当
phantomjs - 使用 phantomjs 读取响应体
有什么方法可以使用 phantomjs 请求资源并能够到达响应的主体吗？最佳答案更新:关于“获取并使用所有其他资源(如图像、CSS、字体等)做某事”的其他可能含义，我最近在博客上写了 how to
phantomjs - 加快 phantomjs 屏幕捕获时间？
在运行 PhantomJS 提供的 rasterize.js 示例时，我发现我必须等待 20 秒或更长时间才能生成网页图像。有没有可能在不消耗大量资源的情况下加快速度的方法？我基本上希望快速生成从加

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - PhantomJS 不提取链接 Selenium