gpt4 book ai didi

python - 网页抓取:我只得到了我想要的文本的 1/10(使用 BeautifulSoup)

转载 作者:太空宇宙 更新时间:2023-11-04 03:39:44 25 4
gpt4 key购买 nike

我正在尝试从网页中抓取数据,我想要的所有文本都放在 <p class="heading2"> 之间和 More... .

它适用于第一批文本,但仅适用于那一批。

例如我得到:

Info about grant 1

但是网站上有:

Info about grant 1 
Info about grant 2
Info about grant 3
etc.

这是我正在使用的代码。我是 BeautifulSoup 的新手,希望有人能提供帮助!

from bs4 import BeautifulSoup
import sheetsync
import urllib2, csv
url = urllib2.urlopen('http://www.asanet.org/funding/funding_and_grants.cfm').read()
def processData():
url = urllib2.urlopen('http://www.asanet.org/funding/funding_and_grants.cfm').read()
soup = BeautifulSoup(url)
metaData = soup.find_all("div", {"id":"memberscontent"})
authors = []
for html in metaData:
text = BeautifulSoup(str(html).strip()).encode("utf-8").replace("Deadline", "DEADLINE").replace('\s+',' ').replace('\n+',' ').replace('\s+',' ')
authors.append(text.split('<p class="heading2">')[1].split('More...')[0].strip()) # get Pos
txt = 'grants.txt'
with open(txt, 'ab') as out:
out.writelines(authors)
processData()

最佳答案

我会依赖 heading2 并获得接下来的两个 p 标签 siblings : 第一个是截止日期,第二个是 grant 的文本:

import urllib2
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.asanet.org/funding/funding_and_grants.cfm'))

for heading in soup.select('div#memberscontent p.heading2'):
deadline = heading.find_next_sibling('p')
article = deadline.find_next_sibling('p')

print heading.get_text(strip=True)
print deadline.get_text(strip=True)
print article.get_text(strip=True)
print "----"

打印:

The Sydney S. Spivack Program in Applied Social Research and Social PolicyASA Congressional Fellowship
Deadline: February 15
The ASA encourages applications for its Congressional Fellowship. The Fellowship brings a PhD-level sociologist to Washington, DC, to work as a staff member on a congressional committee, in a congressional member office, or in a congressional agency (e.g., the Government Accountability Office). This intensive six-month experience reveals the intricacies of the policy making process to the sociological fellow, and shows the usefulness of sociological data and concepts to policy issues.  [More...]
----
Community Action Research Initiative (CARI Grants) The Sydney S. Spivack Program in Applied Social Research and Social Policy
Deadline:  February 15
To encourage sociologists to undertake community action projects that bring social science knowledge, methods, and expertise to bear in addressing community-identified issues and concerns, ASA administers competitive CARI awards. Grant applications are encouraged from sociologists seeking to work with community organizations, local public interest groups, or community action projects. Appointments will run for the duration of the project, whether the activity is to be undertaken during the year, in the summer, or for other time-spans.   [More...]
----
Fund for the Advancement of the Discipline
Deadlines:  June 15 | December 15
The American Sociological Association invites submissions by PhD sociologists for the Fund for the Advancement of the Discipline (FAD) awards. Supported by the American Sociological Association through a matching grant from the National Science Foundation, the goal of this project is to nurture the development of scientific knowledge by funding small, groundbreaking research initiatives and other important scientific research activities such as conferences. FAD awards provide scholars with small grants ($7,000 maximum) for innovative research that has the potential for challenging the discipline, stimulating new lines of research, and creating new networks of scientific collaboration. The award is intended to provide opportunities for substantive and methodological breakthroughs, broaden the dissemination of scientific knowledge, and provide leverage for acquisition of additional research funds.  [More...]
----
...

关于python - 网页抓取:我只得到了我想要的文本的 1/10(使用 BeautifulSoup),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27068798/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com