python - 用美汤刮痧-6ren

python - 用美汤刮痧

转载作者：太空宇宙更新时间：2023-11-03 18:26:39

我正在使用 BeautifulSoup 抓取一篇文章。我想删除文章正文中除特定部分之外的所有 p 标签。我想知道是否有人可以提示我我做错了什么？我没有收到错误，只是没有出现任何不同的情况。目前，它正在从不需要的部分中抓取单词“Print”，并将其与其他 p 标签一起打印。

我想忽略的部分:soup.find("div", {'class': 'add-this'})

    url: http://www.un.org/apps/news/story.asp?NewsID=47549&Cr=burundi&Cr1=#.U0vmB8fTYig

    # Parse HTML of article, aka making soup
    soup = BeautifulSoup(urllib2.urlopen(url).read())

    # Retrieve all of the paragraphs
    tags = soup.find("div", {'id': 'fullstory'}).find_all('p')
    for tag in tags:
        ptags = soup.find("div", {'class': 'add-this'})
        for tag in ptags:
            txt.write(tag.nextSibling.text.encode('utf-8') + '\n' + '\n')
        else:
            txt.write(tag.text.encode('utf-8') + '\n' + '\n')

最佳答案

一种选择是只传递 recursive=False ，以便不在 fullstory div 的任何其他元素内搜索 p 标签:

tags = soup.find("div", {'id': 'fullstory'}).find_all('p', recursive=False)
for tag in tags:
    print tag.text

这只会从 div 中获取顶级段落，打印完整的文章:

10 April 2014  The United Nations today called on the Government...
...
...follow up with the Government on these concerns.

关于python - 用美汤刮痧，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23063459/

文章推荐： c# - 使用 For 循环遍历 DataSet 中的所有 DataTable

文章推荐： ruby - 当我开始上课时传递 self

文章推荐： javascript - 选择框选项列表在菜单上重叠

文章推荐： ruby activerecord 简单的一个 : call attribute by variable

python - 刮痧 : AttributeError: 'list' object has no attribute 'iteritems'
这是我关于堆栈溢出的第一个问题。最近想用linked-in-scraper ，所以我下载并指示“scrapy crawllinkedin.com”并收到以下错误消息。供您引用，我使用 anaconda

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 用美汤刮痧