gpt4 book ai didi

python - 用 BeautifulSoup 和多个段落进行抓取

转载 作者:太空狗 更新时间:2023-10-29 20:20:37 24 4
gpt4 key购买 nike

我正在尝试使用 BeautifulSoup 从网站上抓取一段演讲。然而,我遇到了问题,因为演讲分为许多不同的段落。我对编程非常陌生,并且无法弄清楚如何处理这个问题。该页面的 HTML 如下所示:

<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney, 
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is
at war; our economy is in recession; and the civilized world faces unprecedented dangers.
Yet, the state of our Union has never been stronger.
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims,
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps,
saved a people from starvation, and freed a country from brutal oppression.
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to
sacrifice their lives are running for their own.

它会像这样继续一段时间,带有多个段落标签。我正在尝试提取范围内的所有文本。

我尝试了几种不同的方法来获取文本,但都未能获得我想要的文本。

我首先尝试的是:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
print thespan.string

这给了我:

Mr. Speaker, Vice President Cheney, Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is at war; our economy is in recession; and the civilized world faces unprecedented dangers. Yet, the state of our Union has never been stronger.

这是直到第一个段落标记的文本部分。然后我尝试了:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
for section in thespan:
paragraph = section.findNext('p')
if paragraph and paragraph.string:
print '>', paragraph.string
else:
print '>', section.parent.next.next.strip()

这给了我第一个段落标签和第二个段落标签之间的文本。所以,我正在寻找一种方法来获取整个文本,而不仅仅是部分。

最佳答案

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())

span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x.contents[0] for x in span.findAllNext("p")] # this gives you the rest
# use .contents[0] instead of .string to deal with last para that's not well formed

print "%s\n\n%s" % (span.string, "\n\n".join(paras))

正如评论中所指出的,如果 <p>标签包含更多嵌套标签。这可以使用以下方法处理:

paras = ["".join(x.findAll(text=True)) for x in span.findAllNext("p")]

但是,这对最后一个 <p> 不太适用没有结束标签。一个棘手的解决方法是区别对待。例如:

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())
span = soup.find("span", {"class":"displaytext"})
paras = [x for x in span.findAllNext("p")]

start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s" % (start, middle, last)

关于python - 用 BeautifulSoup 和多个段落进行抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8331579/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com