gpt4 book ai didi

python - 网页抓取无法获取所有表格

转载 作者:行者123 更新时间:2023-12-02 11:38:31 26 4
gpt4 key购买 nike

我编写了使用 BeautifulSoup 和 Selenium 获取表格的代码。

但是,仅获取了表的部分内容。访问website时不出现的行和列不是由 soup 对象获取的。

我确定问题出现在摘录 WebDriverWait(driver, 10).until (EC.visibility_of_element_ located((By.ID,"contenttabledivjqxGrid")))

...我尝试了其他几种替代方法,但没有一个给我预期的结果(即在我使用 Selenium 更改日期之前加载此表的所有行和列)。

enter image description here

按照代码操作:

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup

​from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

# Escolhe o driver Firefox com Profile e Options
driver = webdriver.FirefoxProfile()
driver.set_preference('intl.accept_languages', 'pt-BR, pt')
driver.set_preference('browser.download.folderList', '2')
driver.set_preference('browser.download.manager.showWhenStarting', 'false')
driver.set_preference('browser.download.dir', 'dwnd_path')
driver.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/octet-stream,application/vnd.ms-excel')

options = Options()
options.headless = False

driver = webdriver.Firefox(firefox_profile=driver, options=options)

# Cria um driver

site = 'http://mananciais.sabesp.com.br/HistoricoSistemas'
driver.get(site)


WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.ID,"contenttabledivjqxGrid")))
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Cabeçalho
header = soup.find_all('div', {'class': 'jqx-grid-column-header'})
for i in header:
print(i.get_text())


# Seleciona as relevantes
head = []
for i in header:
if i.get_text().startswith(('Represa', 'Equivalente')):
print('Excluído: ' + i.get_text())
else:
print(i.get_text())
head.append(i.get_text())

print('-'*70)
print(head)
print('-'*70)
print('Número de Colunas: ' + str(len(head)))

# Valores
data = soup.find_all('div', {'class': 'jqx-grid-cell'})
values = []
for i in data:
print(i.get_text())
values.append(i.get_text())


import numpy as np
import pandas as pd

# Convert data to numpy array
num = np.array(values)

# Currently its shape is single dimensional
n_rows = int(len(num)/len(head))
n_cols = int(len(head))
reshaped = num.reshape(n_rows, n_cols)

# Construct Table
pd.DataFrame(reshaped, columns=head)

我只是一名水文学家,想要获得这个水库的数据。有人可以帮助我吗?

目前我的结果表是这样的:

enter image description here

最佳答案

看起来表格是动态加载的,并且在 HTML 中只有表格的一部分可见,因此这就是为什么您只获得部分数据的原因。可能的解决方案是使用 Selenium 的滚动条并逐位读取数据。

关于python - 网页抓取无法获取所有表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60840490/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com