gpt4 book ai didi

python - BeautifulSoup4 抓取 : Pandas "arrays must all be same length" when exporting data to csv

转载 作者:行者123 更新时间:2023-12-05 07:30:50 25 4
gpt4 key购买 nike

我正在使用 BeautifulSoup4 从网站上抓取信息,并使用 Pandas 将数据导出到 csv 文件。字典中有 5 列数据,由 5 个列表表示。但是,由于该网站没有所有 5 个类别的完整数据,因此某些列表的项目少于其他列表。所以当我尝试导出数据时,pandas 给了我

ValueError: arrays must all be same length.

处理这种情况的最佳方法是什么?具体来说,项目较少的列表是“作者”和“页面”。提前致谢!
代码:

import requests as r
from bs4 import BeautifulSoup as soup
import pandas

#make a list of all web pages' urls
webpages=[]
for i in range(15):
root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All&region=All&page='+ str(i)
webpages.append(root_url)
print(webpages)
#start looping through all pages
titles = []
journals = []
authors = []
pages = []
dates = []
issues = []

for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')

#find targeted info and put them into a list to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles += [el.replace('\n', '') for el in title_list]

journal_list = [journal.text for journal in page_soup.find_all('em')]
journals += [el.replace('\n', '') for el in journal_list]

author_list = [author.text for author in page_soup.find_all('div', {'class':'field field--name-field-citation-authors field--type-string field--label-hidden field__item'})]
authors += [el.replace('\n', '') for el in author_list]

pages_list = [pages.text for pages in page_soup.find_all('div', {'class':'field field--name-field-citation-pages field--type-string field--label-hidden field__item'})]
pages += [el.replace('\n', '') for el in pages_list]

date_list = [date.text for date in page_soup.find_all('div', {'class':'field field--name-field-date field--type-datetime field--label-hidden field__item'})]
dates += [el.replace('\n', '') for el in date_list]

issue_list = [issue.text for issue in page_soup.find_all('div', {'class':'field field--name-field-issue-number field--type-integer field--label-hidden field__item'})]
issues += [el.replace('\n', '') for el in issue_list]

# export to csv file via pandas
dataset = {'Title': titles, 'Author': authors, 'Journal': journals, 'Date': dates, 'Issue': issues, 'Pages': pages}
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example45.csv', encoding="utf-8")

最佳答案

如果您确定,例如标题的长度总是正确的,您可以这样做:

title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles_to_add = [el.replace('\n', '') for el in title_list]
titles += titles_to_add

...

author_list = [author.text for author in page_soup.find_all('div', {'class':'field field--name-field-citation-authors field--type-string field--label-hidden field__item'})]
authors_to_add = [el.replace('\n', '') for el in author_list]
if len(authors_to_add) < len(titles_to_add):
while len(authors_to_add) < len(titles_to_add):
authors_to_add += " "
authors += authors_to_add

pages_list = [pages.text for pages in page_soup.find_all('div', {'class':'field field--name-field-citation-pages field--type-string field--label-hidden field__item'})]
pages_to_add = [el.replace('\n', '') for el in pages_list]
if len(pages_to_add) < len(titles_to_add):
while len(pages_to_add) < len(titles_to_add):
pages_to_add += " "
pages += pages_to_add

但是......这只会将元素添加到列中,以便它们具有正确的长度,以便您可以创建数据框。但是在您的数据框中,作者和页面将不在正确的行中。你将不得不稍微改变你的算法来实现你的最终目标......如果你遍历页面上的所有行并获得标题等会更好......像这样:

rows = page_soup.find_all('div', {'class':'views-row'})
for row in rows:
title_list = [title.text for title in row.find_all('div', {'class':'field field-name-node-title'})]
...

然后您需要检查标题、作者等是否存在 len(title_list)>0 如果不存在,请添加 "None" 或其他内容到具体名单。那么你的 df 中的一切都应该是正确的。

关于python - BeautifulSoup4 抓取 : Pandas "arrays must all be same length" when exporting data to csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52076814/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com