gpt4 book ai didi

python - 网络抓取文章 - 个人合著者数据

转载 作者:太空宇宙 更新时间:2023-11-03 20:42:06 25 4
gpt4 key购买 nike

我正在抓取《米尔银行季刊》上发表的文章。我对有关作者及其所属机构的数据特别感兴趣。我使用 beautifulsoup 和 pandas 库编写了代码,以便将我的输出保存为 csv。 csv 每篇文章包含一行。这意味着,对于具有多个作者的文章,“作者”列包含所有作者,“机构”列包含共同创作该文章的作者的所有机构。相反,我希望输出是 csv 每个作者有一行;换句话说,每篇文章有多行。这是因为我想最终计算每个机构在期刊中出现的次数。

我使用 beautifulsoup .find_all 方法来获取所有数据。最初,我尝试使用 .find_all_next 来获取作者和机构,认为这可以容纳具有多个作者的文章,但只是没有返回这些列的任何内容。

重写此代码以便每个作者获得自己的行的最佳方法是什么?

import pandas as pd
import numpy as np
import requests
import re
import urllib
from bs4 import BeautifulSoup
from bs4 import SoupStrainer

articletype=list()
articlelist=list()
titlelist=list()
vollist=list()
issuenumlist=list()
authorlist = list()
instlist = list()
urllist=list()

issueurllist = ['https://onlinelibrary.wiley.com/toc/14680009/2018/96/1', 'https://onlinelibrary.wiley.com/toc/14680009/2018/96/2','https://onlinelibrary.wiley.com/toc/14680009/2018/96/3','https://onlinelibrary.wiley.com/toc/14680009/2018/96/4']

for issue in issueurllist:
requrl = requests.get(issue)
soup = BeautifulSoup(requrl.text, 'lxml')

#Open url of each article.

baseurl = 'https://onlinelibrary.wiley.com'

for article in issue:
doi=[a.get('href') for a in soup.find_all('a', title = "Full text")]

for d in doi:
doilink = baseurl + d
opendoi = requests.get(doilink)
articlesoup=BeautifulSoup(opendoi.text, 'lxml')

```Get metadata for each article```
for tag in articlesoup:
arttype=articlesoup.find_all("span", {"class":"primary-heading"})
title=articlesoup.find_all("meta",{"name":"citation_title"})
vol=articlesoup.find_all("meta",{"name":"citation_volume"})
issuenum = articlesoup.find_all("meta",{"name":"citation_issue"})
author = articlesoup.find_all("meta",{"name":"citation_author"})
institution=articlesoup.find_all("meta",{"name":"citation_author_institution"})
url=articlesoup.find_all("meta",{"name":"citation_fulltext_html_url"})

articletype.append(arttype)
titlelist.append(title)
vollist.append(vol)
issuenumlist.append(issuenum)
authorlist.append(author)
instlist.append(institution)
urllist.append(url)

milbankdict={'article type':articletype, 'title':titlelist, 'vol':vollist, 'issue':issuenumlist,'author':authorlist, 'author institution':instlist, 'url':urllist}
milbank2018=pd.DataFrame(milbankdict)
milbank2018.to_csv('milbank2018.csv')
print("saved")

最佳答案

find_all方法总是返回一个列表,如您所见,我正在验证 tag_object 不是 None,这是一个重要的测试用例,因为某些作者不包含元属性然后返回 None。每个元属性不需要多个列表,您可以使用字典进行管理,这里我按作者格式化数据并关联所有元属性。

strip() Python 的内置函数用于删除字符串中所有前导和尾随空格。

import requests
from bs4 import BeautifulSoup
import pandas as pd

issueurllist = ['https://onlinelibrary.wiley.com/toc/14680009/2018/96/1',
'https://onlinelibrary.wiley.com/toc/14680009/2018/96/2',
'https://onlinelibrary.wiley.com/toc/14680009/2018/96/3',
'https://onlinelibrary.wiley.com/toc/14680009/2018/96/4'
]

base_url = 'https://onlinelibrary.wiley.com'

json_data = []

for issue in issueurllist:
response1 = requests.get(issue)
soup1 = BeautifulSoup(response1.text, 'lxml')

for article in issue:
doi=[a.get('href') for a in soup1.find_all('a', title = "Full text")]

for i in doi:
article_dict = {"article":"NaN","title":"NaN","vol":"NaN","issue":"NaN","author":"NaN","institution":"NaN","url":"NaN"}
article_url = base_url + i
response2 = requests.get(article_url)
soup2=BeautifulSoup(response2.text, 'lxml')

'''Get metadata for each article'''

article = soup2.find("span", {"class":"primary-heading"})
title = soup2.find("meta",{"name":"citation_title"})
vol = soup2.find("meta",{"name":"citation_volume"})
issue = soup2.find("meta",{"name":"citation_issue"})
author = soup2.find("meta",{"name":"citation_author"})
institution = soup2.find("meta",{"name":"citation_author_institution"})
url = soup2.find("meta",{"name":"citation_fulltext_html_url"})

if article is not None:
article_dict['article']= article.text.strip()

if title is not None:
article_dict['title']= title['content'].strip()

if vol is not None:
article_dict['vol']= vol['content'].strip()

if issue is not None:
article_dict['issue']= issue['content'].strip()

if author is not None:
article_dict['author']= author['content'].strip()

if institution is not None:
article_dict['institution']= institution['content'].strip()

if url is not None:
article_dict['url']= url['content'].strip()

json_data.append(article_dict)

df=pd.DataFrame(json_data)
df.to_csv('milbank2018.csv')

关于python - 网络抓取文章 - 个人合著者数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56799999/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com