gpt4 book ai didi

python - 无法使用 Selenium 从页面中名为 'heading' 的每个类中获取数据

转载 作者:行者123 更新时间:2023-12-04 09:12:35 26 4
gpt4 key购买 nike

嗨,我是数据抓取的新手。在这里,我试图从具有 的所有类中抓取数据。 '标题' 属性。但是在我的代码中,即使我使用 for 循环进行迭代,它也只打印第一个元素。
预期输出 - 从具有“标题”属性的所有页面类中抓取数据
实际输出 - 仅从类名为“标题”的第一个元素中抓取数据,甚至不单击下一步按钮。
我用于测试的站点是 here

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd
from openpyxl.workbook import Workbook


DRIVER_PATH = 'C:/Users/Aishwary/Downloads/chromedriver_win32/chromedriver'

driver = webdriver.Chrome(executable_path=DRIVER_PATH)

driver.get('https://www.fundoodata.com/citiesindustry/19/2/list-of-information-technology-(it)-companies-in-noida')

# get all classes which has heading as a class name
company_names = driver.find_elements_by_class_name('heading')

# to store all companies names from heading class name
names_list = []

while True:

try:
for name in company_names: # iterate each name in all div classes named as heading
text = name.text # get text data from those elements
names_list.append(text)
print(text)
# Click on next button to get data from next pages as well
driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="main-container"]/div[2]/div[4]/div[2]/div[44]/div[1]/ul/li[7]/a'))))
driver.find_element_by_xpath('//*[@id="main-container"]/div[2]/div[4]/div[2]/div[44]/div[1]/ul/li[7]/a').click()

except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break


driver.quit()

# Store those data in excel sheet
df = pd.DataFrame(names_list)
writer = pd.ExcelWriter('companies_names.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='List')
writer.save()

最佳答案

此脚本将从页面中获取所有企业名称:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.fundoodata.com/citiesindustry/19/2/list-of-information-technology-(it)-companies-in-noida'

all_data = []
while True:
print(url)

soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
for h in soup.select('div.heading'):
all_data.append({'Name' : h.text})
print(h.text)

next_page = soup.select_one('a:contains("Next")')
if not next_page:
break

url = 'https://www.fundoodata.com' + next_page['href']

df = pd.DataFrame(all_data)
print(df)

df.to_csv('data.csv')
打印:
                              Name
0 BirlaSoft Ltd
1 HCL Infosystems Ltd
2 HCL Technologies Ltd
3 NIIT Technologies Ltd
4 3Pillar Global Pvt Ltd
.. ...
481 Innovaccer Analytics Pvt Ltd
482 Kratikal Tech Pvt Ltd
483 Sofocle Technologies
484 SquadRun Solutions Pvt Ltd
485 Zaptas Technologies Pvt Ltd

[486 rows x 1 columns]
并保存 data.csv (来自 LibreOffice 的截图):
enter image description here

关于python - 无法使用 Selenium 从页面中名为 'heading' 的每个类中获取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63325463/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com