gpt4 book ai didi

Python:BeautifulSoup Scrape,类(class)的空白描述搞乱了数据

转载 作者:行者123 更新时间:2023-12-01 01:24:22 26 4
gpt4 key购买 nike

我正在尝试从网站 https://bulletins.psu.edu/university-course-descriptions/undergraduate/ 抓取一些类(class)数据对于一个项目。

# -*- coding: utf-8 -*-
"""
Created on Mon Nov 5 20:37:33 2018

@author: DazedFury
"""
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
import requests

# returns a CloudflareScraper instance
#scraper = cfscrape.create_scraper()

#URL and textfile
text_file = open("Output.txt", "w", encoding='UTF-8')
page_link = 'https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/'
page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")

#Array for storing URL's
URLArray = []

#Find links
for link in page_content.find_all('a'):
if('/university-course-descriptions/undergraduate' in link.get('href')):
URLArray.append(link.get('href'))
k = 1

#Parse Loop
while(k != 242):
print("Writing " + str(k))

completeURL = 'https://bulletins.psu.edu' + URLArray[k]

# this is the url that we've already determined is safe and legal to scrape from.
page_link = completeURL

# here, we fetch the content from the url, using the requests library
page_response = requests.get(page_link)

#we use the html parser to parse the url content and store it in a variable.
page_content = BeautifulSoup(page_response.content, "html.parser")
page_content.prettify

#Find and print all text with tag p
paragraphs = page_content.find_all('div', {'class' : 'course_codetitle'})
paragraphs2 = page_content.find_all('div', {'class' : 'courseblockdesc'})
j = 0
for i in range(len(paragraphs)):
if i % 2 == 0:
text_file.write(paragraphs[i].get_text())
text_file.write("\n")
if j < len(paragraphs2):
text_file.write(" ".join(paragraphs2[j].get_text().split()))
text_file.write("\n")
text_file.write("\n")
if(paragraphs2[j].get_text() != ""):
j += 1

k += 1

#FORMAT
#text_file.write("<p style=\"page-break-after: always;\">&nbsp;</p>")
#text_file.write("\n\n")

#Close Text File
text_file.close()

我需要的具体信息是类(class)标题​​和描述。问题是有些类(class)有空白描述,这会打乱顺序并提供错误的数据。

output.txt

bulletin

我想过只检查类(class)描述是否为空,但在网站上,如果类(class)没有描述,则“courseblockdesc”标签不存在。因此,当我 find_all courseblockdesc 时,列表实际上并未向数组添加元素,因此顺序最终困惑。这方面的错误太多,无法手动修复,所以我希望有人可以帮助我找到解决方案。

最佳答案

最简单的解决方案是在一个 find_all 中遍历每个项目,查找您要查找的项目的父项。

for block in page_content.find_all('div', class_="courseblock"):
title = block.find('div', {'class' : 'course_codetitle'})
description = block.find('div', {'class' : 'courseblockdesc'})
# do what you need with the navigable strings here.
print(title.get_text()
if description:
print(description.get_text())

关于Python:BeautifulSoup Scrape,类(class)的空白描述搞乱了数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53490816/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com