gpt4 book ai didi

python - Urllib Python 没有提供我在检查元素中看到的 html 代码

转载 作者:可可西里 更新时间:2023-11-01 13:32:40 25 4
gpt4 key购买 nike

我正在尝试抓取此链接中的结果:

url = "http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F "

当我用 firebug 检查它时,我可以看到 html 代码,并且我知道我需要做什么来提取推文。问题是当我使用 urlopen 获得响应时,我没有得到相同的 html 代码。我只得到标签。我错过了什么?

示例代码如下:

   def get_tweets(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
tweets = soup.find("div", "results")
category_links = [dd.a["href"] for tweet in tweets.findAll("div", "result-tweet")]
return category_links

url = "http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F"
cat_links = get_tweets(url)

谢谢,YB

最佳答案

问题是 results div 的内容充满了额外的 HTTP 调用和在浏览器端执行的 javascript 代码。 urllib 仅“看到”不包含您需要的数据的初始 HTML 页面。

一种选择是遵循@Himal 的建议并模拟对 trackbacks.js 的底层请求,该请求是为带有推文的数据发送的。结果为 JSON 格式,您可以 load()使用 json标准库自带的模块:

import json
import urllib2

url = 'http://otter.topsy.com/trackbacks.js?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&infonly=0&call_timestamp=1411090809443&apikey=09C43A9B270A470B8EB8F2946A9369F3'
data = json.load(urllib2.urlopen(url))
for tweet in data['response']['list']:
print tweet['permalink_url']

打印:

http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832

这是“深入金属”选项。


否则,您可以采用“高级”方法,而不必担心幕后发生的事情。让真实的浏览器加载您将通过 selenium WebDriver 与之交互的页面:

from selenium import webdriver

driver = webdriver.Chrome() # can be Firefox(), PhantomJS() and more
driver.get("http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F")

for tweet in driver.find_elements_by_class_name('result-tweet'):
print tweet.find_element_by_xpath('.//div[@class="media-body"]//ul[@class="inline"]/li//a').get_attribute('href')

driver.close()

打印:

http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832

这是您可以缩放第二个选项以获取分页后的所有推文的方式:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

BASE_URL = 'http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&offset={offset}'

driver = webdriver.Chrome()

# get tweets count
driver.get('http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F')
tweets_count = int(driver.find_element_by_xpath('//li[@data-name="all"]/a/span').text)

for x in xrange(0, tweets_count, 10):
driver.get(BASE_URL.format(offset=x))

# page header appears in case no more tweets found
try:
driver.find_element_by_xpath('//div[@class="page-header"]/h3')
except NoSuchElementException:
pass
else:
break

# wait for results
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.ID, "results"))
)

# get tweets
for tweet in driver.find_elements_by_class_name('result-tweet'):
print tweet.find_element_by_xpath('.//div[@class="media-body"]//ul[@class="inline"]/li//a').get_attribute('href')

driver.close()

关于python - Urllib Python 没有提供我在检查元素中看到的 html 代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25924890/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com