gpt4 book ai didi

python - Web 抓取到 CSV 的问题 [AttributeError : 'str' object has no attribute 'text]

转载 作者:行者123 更新时间:2023-12-04 03:37:29 26 4
gpt4 key购买 nike

我正在尝试构建一个自动网络抓取工具,我花了数小时观看 YT 视频和阅读这里的资料。编程新手(一个月前开始)和这个社区的新手...

因此,使用 VScode 作为我的 IDE,我遵循了实际上用作网络抓取工具的代码格式(python 和 selenium):


from selenium import webdriver
import time
from selenium.webdriver.support.select import Select

with open('job_scraping_multipe_pages.csv', 'w') as file:
file.write("Job_title, Location, Salary, Contract_type, Job_description \n")

driver= webdriver.Chrome()
driver.get('https://www.jobsite.co.uk/')

driver.maximize_window()
time.sleep(1)

cookie= driver.find_element_by_xpath('//button[@class="accept-button-new"]')
try:
cookie.click()
except:
pass

job_title=driver.find_element_by_id('keywords')
job_title.click()
job_title.send_keys('Software Engineer')
time.sleep(1)

location=driver.find_element_by_id('location')
location.click()
location.send_keys('Manchester')
time.sleep(1)

dropdown=driver.find_element_by_id('Radius')
radius=Select(dropdown)
radius.select_by_visible_text('30 miles')
time.sleep(1)

search=driver.find_element_by_xpath('//input[@value="Search"]')
search.click()
time.sleep(2)

for k in range(3):
titles=driver.find_elements_by_xpath('//div[@class="job-title"]/a/h2')
location=driver.find_elements_by_xpath('//li[@class="location"]/span')
salary=driver.find_elements_by_xpath('//li[@title="salary"]')
contract_type=driver.find_elements_by_xpath('//li[@class="job-type"]/span')
job_details=driver.find_elements_by_xpath('//div[@title="job details"]/p')

with open('job_scraping_multipe_pages.csv', 'a') as file:
for i in range(len(titles)):
file.write(titles[i].text + "," + location[i].text + "," + salary[i].text + "," + contract_type[i].text + ","+
job_details[i].text + "\n")


next=driver.find_element_by_xpath('//a[@aria-label="Next"]')
next.click()
file.close()
driver.close()

成功了。然后我尝试为另一个网站复制结果。我没有点击“下一步”按钮,而是找到了一种方法使 URL 的结尾数字增加 1。但是我的问题来自代码的最后部分,给了我 AttributeError: 'str'对象没有属性“文本”。以下是使用 Python 和 Selenium 编写的我定位的网站 ( https://angelmatch.io/pitch_decks/5285 ) 的代码:


from selenium import webdriver
import time
from selenium.webdriver.support.select import Select

driver = webdriver.Chrome()


with open('pitchDeckResults2.csv', 'w' ) as file:
file.write("Startup_Name, Startup_Description, Link_Deck_URL, Startup_Website, Pitch_Deck_PDF, Industries, Amount_Raised, Funding_Round, Year /n")




for k in range(5285, 5287, 1):

linkDeck = "https://angelmatch.io/pitch_decks/" + str(k)

driver.get(linkDeck)
driver.maximize_window
time.sleep(2)

startupName = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[1]')
startupDescription = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[3]/p[2]')
startupWebsite = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[3]/a')
pitchDeckPDF = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/button/a')
industries = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[2]')
amountRaised = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[1]/b')
fundingRound = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[1]')
year = driver.find_elements_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[2]/b')



with open('pitchDeckResults2.csv', 'a') as file:
for i in range(len(startupName)):
file.write(startupName[i].text + "," + startupDescription[i].text + "," + linkDeck[i].text + "," + startupWebsite[i].text + "," + pitchDeckPDF[i].text + "," + industries[i].text + "," + amountRaised[i].text + "," + fundingRound[i].text + "," + year[i].text +"\n")

time.sleep(1)

file.close()

driver.close()

我将不胜感激任何帮助!我正在尝试使用这种技术将数据转换为 CSV!

最佳答案

老实说,你做得很好。唯一的事情以及为什么会出现错误,您正在尝试从字符串类型值中获取 .text 变量。 python 中的 str 类型没有任何文本变量。此外,您正在尝试通过 [i] 迭代它可以达到“列表索引超出范围”。异常(exception)。您要放置在 linkDeck[i].text 位置的内容可能是 page.title?还是什么?

顺便说一句,在使用 with open() 语句时不应该关闭文件。它是上下文管理器,在你离开文件后它就没有你了

将添加的列添加到 maxamize_window() 并删除 1 个文件打开,并仅添加链接:

import time

from selenium import webdriver

driver = webdriver.Chrome()
delimeter = ';'
with open('pitchDeckResults2.csv', 'w+') as _file:
_l = ['Startup_Name', 'Startup_Description', 'Link_Deck_URL', 'Startup_Website', 'Pitch_Deck_PDF', 'Industries',
'Amount_Raised', 'Funding_Round', 'Year \n']
_file.write(delimeter.join(_l))
for k in range(5285, 5287, 1):
linkDeck = "https://angelmatch.io/pitch_decks/" + str(k)

driver.get(linkDeck)
time.sleep(1)

startupName = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[1]')
startupDescription = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div/div[3]/p[2]')
startupWebsite = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[3]/a')
pitchDeckPDF = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/button/a')
industries = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[2]')
amountRaised = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[1]/b')
fundingRound = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/a[1]')
year = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[3]/div[1]/div/p[2]/b')

all_elements = [startupName.text, startupDescription.text, linkDeck, startupWebsite.text, pitchDeckPDF.text,
industries.text, amountRaised.text, fundingRound.text, f"{year.text}\n"]
_str = delimeter.join(all_elements)
_file.write(_str)

driver.close()

如果我错过了什么,请告诉我

关于python - Web 抓取到 CSV 的问题 [AttributeError : 'str' object has no attribute 'text],我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66632851/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com