gpt4 book ai didi

python - BeautifulSoup 抓取 td 和 tr

转载 作者:太空宇宙 更新时间:2023-11-03 18:11:49 32 4
gpt4 key购买 nike

我正在尝试从第三个表( Jade 米)中提取价格数据(最高价和最低价)。代码返回“None”:

import urllib2                          
from bs4 import BeautifulSoup
import time
import re
start_urls = 4539
nb_quotes = 10
for urls in range (start_urls, start_urls - nb_quotes, -1):

start_time = time.time()

# construct the URLs strings
url = 'http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains'

# Read the HTML page content
page = urllib2.urlopen(url)

# Create a beautifulsoup object
soup = BeautifulSoup(page)

# Search the table to be parsed in the whole HTML code
tables = soup.findAll('table')
tab = tables[2] # This is the table to be parsed

low_tmp = str(tab.findAll('tr')[0].findAll('td')[1].getText()) #Low price
low = re.sub('[+]', '', low_tmp)
high_tmp = str(tab.findAll('tr')[0].findAll('td')[2].string) # High price
high = re.sub('[+]', '', high_tmp)


stop_time = time.time()


print low, '\t', high, '(%0.1f s)' % (stop_time - start_time)

最佳答案

表中的数据是使用以下 JavaScript 调用在浏览器端填充的:

document.write(getQuoteboardHTML(
splitQuote(quotes, 'ZC*1,ZC*2,ZC*3,ZC*4,ZC*5,ZC*6,ZC*7,ZC*8,ZC*9'.split(/,/)),
'shortmonthonly,high,low,last,change'.split(/,/), { nospacers: true }));

BeautifulSoup 是一个 HTML 解析器 - 它不会执行 javascript。

基本上,您需要一些东西来为您执行该 JavaScript。

一种解决方案是在 selenium 的帮助下使用真正的浏览器:

from selenium import webdriver


url = "http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains"

driver = webdriver.Firefox()
driver.get(url)

table = driver.find_element_by_xpath('//td[contains(div[@class="fixedpage_heading"], "CORN")]/table[@class="homepage_quoteboard"]')
for row in table.find_elements_by_tag_name('tr')[1:]:
month = row.find_element_by_class_name('quotefield_shortmonthonly').text
low = row.find_element_by_class_name('quotefield_low').text
high = row.find_element_by_class_name('quotefield_high').text

print month, low, high

driver.close()

打印:

SEP 329-0 338-0
DEC 335-6 345-4
MAR 348-2 358-0
MAY 356-6 366-0
JUL 364-0 373-4
SEP 372-0 379-4
DEC 382-0 390-2
MAR 392-4 399-0
MAY 400-0 405-0
<小时/>

另一个选择是“深入了解”,看看 splitQuote()getQuoteboardHTML() js 函数实际上做了什么。使用浏览器开发者工具,您可以看到有一个底层请求前往 this url ,它返回一段 javascript 代码,其中包含带有页面上表格数据的所有对象:

var quotes = { 'ZC*1': { name: 'Corn', flag: 's', price_2_close: '338.75', open_interest: '2701', tradetime: '20140911133000', symbol: 'ZCU14', open: '338', high: '338', low: '329', last: '331.75', change: '-7', pctchange: '-2.07', volume: '1623', exchange: 'CBOT', type: '2', unitcode: '-1', date: '14104 ... ', month: 'May 2015', shortmonth: 'May 2015' } };

如果您设法从中提取必要的部分 - 这将是您的第二个选择。

关于python - BeautifulSoup 抓取 td 和 tr,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25794935/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com