Python 博客 RSS 提要将 BeautifulSoup 输出抓取到 .txt 文件-6ren

Python 博客 RSS 提要将 BeautifulSoup 输出抓取到 .txt 文件

转载作者：太空宇宙更新时间：2023-11-03 11:06:01

25

4

提前为接下来的长代码块道歉。我是 BeautifulSoup 的新手，但发现有一些有用的教程使用它来抓取博客的 RSS 提要。全面披露:这是改编自该视频教程的代码，它对实现这一目标非常有帮助:http://www.youtube.com/watch?v=Ap_DlSrT-iE .

这是我的问题:该视频很好地展示了如何将相关内容打印到控制台。我需要将每篇文章的文本写到一个单独的 .txt 文件并将其保存到某个目录(现在我只是想保存到我的桌面)。我知道问题出在代码末尾附近的两个 for 循环的范围内(我试图对此进行评论，以便人们快速查看——这是最后一条评论开头 # Here's where I'm lost... )，但我似乎无法自己弄清楚。

目前该程序所做的是获取程序读入的最后一篇文章中的文本，并将其写出到变量 listIterator 中指示的 .txt 文件的数量。因此，在这种情况下，我相信有 20 个 .txt 文件被写出，但它们都包含循环播放的最后一篇文章的文本。我想让程序做的是遍历每篇文章并将每篇文章的文本打印到一个单独的 .txt 文件中。抱歉冗长，但任何见解都将不胜感激。

from urllib import urlopen
from bs4 import BeautifulSoup
import re

# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

# On RSS Feed site, find tags for title of articles and 
# tags for article links to be downloaded.

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)

# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))

for i in listIterator:
    # Print each title to console to ensure program is working. 
    print findPatTitle[i]

    # Read in the linked-to article.
    articlePage = urlopen(findPatLink[i]).read()

    # Find the beginning and end of articles using tags listed below.
    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    # Define article variable that will contain all the content between the 
    # beginning of the article to the end as indicated by variables above.
    article = articlePage[divBegin:divEnd]

    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(article)

    # Compile list of all <p> tags for each article and store in paragList
    paragList = soup.findAll('p')

    # Create empty string to eventually convert items in paragList to string to 
    # be written to .txt files.
    para_string = ''

    # Here's where I'm lost and have some sort of scope issue with my for-loops.
    for i in paragList:
        para_string = para_string + str(i)
        newlist = range(len(findPatTitle))
        for i in newlist:
            ofile = open(str(listIterator[i])+'.txt', 'w')
            ofile.write(para_string)
            ofile.close()

最佳答案

之所以好像只写了最后一篇，是因为所有的文章都是一遍又一遍地写到20个单独的文件中。让我们看看以下内容:

for i in paragList:
    para_string = para_string + str(i)
    newlist = range(len(findPatTitle))
    for i in newlist:
        ofile = open(str(listIterator[i])+'.txt', 'w')
        ofile.write(para_string)
        ofile.close()

您正在为每次迭代将 parag_string 一遍又一遍地写入相同的 20 个文件。你需要做的是，将所有 parag_string 附加到一个单独的列表，比如 paraStringList，然后将其所有内容写入单独的文件，如下所示:

for i, var in enumerate(paraStringList):  # Enumerate creates a tuple
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

现在这需要在您的主循环之外，即 for i in listIterator:(...)。这是该程序的工作版本:

from urllib import urlopen
from bs4 import BeautifulSoup
import re


webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]

listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []

for i in listIterator:

    print findPatTitle[i]

    articlePage = urlopen(findPatLink[i]).read()

    divBegin = articlePage.find("<div class='story-teaser'>")
    divEnd = articlePage.find("<footer class='article-footer'>")

    article = articlePage[divBegin:divEnd]

    soup = BeautifulSoup(article)

    paragList = soup.findAll('p')

    para_string = ''

    for i in paragList:
        para_string += str(i)

    paraStringList.append(para_string)

for i, var in enumerate(paraStringList):
    with open("{0}.txt".format(i), 'w') as writer:
        writer.write(var)

关于Python 博客 RSS 提要将 BeautifulSoup 输出抓取到 .txt 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19621473/

25

4

0

文章推荐： python - 从文件中读取特定值并将它们存储在列表 python 中

文章推荐： c# - 如何将 POCO 类与数据库进行比较？

文章推荐： python - 使用 Pyside - VTK 和 QVTKRenderWindowInteractor

api - 不推荐使用 Google 提要 api，如何找到网站的 RSS 提要？
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 3年前关闭。 Improve thi
javascript - 将 JSON 提要(WP 提要)解析为 React Native 提要页面
当我将 JSON 提要(Wordpress JSON 提要)解析到我的 React Native 提要页面时，我在模拟器中没有看到任何结果，以下是我正在使用的代码； ------------AppBo
没有项目的 RSS 提要
我有一个网页，其中有一个搜索页面。我提供了一个用于搜索的“动态”RSS 提要，以便用户可以订阅他喜欢的任何搜索词的搜索结果。所以我想知道如果该搜索词返回 0 个结果(这意味着我没有“项目”可放入提要
测试 RSS 提要
我想要测试我创建的 RSS Feed，我正在寻找一些好的 RSS Feed 应用程序来测试。最佳答案不要只是尝试一堆，看看它是否有效 - 验证它。让我为您谷歌一下: W3C RSS validat
PHP 按日期排序多个 RSS 提要
我有一些 PHP 代码可以合并两个 RSS 提要。我试图按日期对提要进行排序，但我得到了一个有趣的结果: 两个 Feed 分别排序(第一个 Feed 先列出，然后第二个饲料) 第一个提要按升序排序，第
json - 谷歌电子表格检索 JSON 提要
我最近开始使用“新 Google 表格”(电子表格)，他们将 URL 更改为共享的公共(public)电子表格，我不确定如何获取电子表格数据的 JSON 提要。基于来自此 URL 的数据:https
rss - 使用第三方 RSS 提要
我想知道在您的应用程序中使用其他人的 RSS 提要(例如 BBC RSS 提要)是否存在任何法律问题？最佳答案你真的应该问律师。但是，我在 out-law.com 上找到了这个: Using a
rss - 我如何知道有多少人订阅了我提供的 RSS 提要？
我们有一个提供一些 RSS 提要的站点，我们想知道有多少人订阅了每个提要，而不使用像 FeedBurner 这样的系统来为他们提供服务。解决这个问题的原始方法基本上是记录请求，然后获取请求每个提要的
rss - 以近乎实时的间隔刷新 RSS 提要
我有一个系统可以获取几百个 RSS 提要。目前，它们的刷新周期为 10 分钟，但我最好让它更快。以近实时/推送间隔获取 RSS 源的策略是什么？我遇到的一些解决方案: 1分钟取一次；如果没有变化，则
rss - 如何防止某人入侵 API 提要？
我已经开始开发一个网页，最近聘请了某人编写代码以在 http://farmball.com/ 的中间面板中显示自定义提要(由 API 提供支持)。 . 请注意，这不是与站点博客相关联的 RSS 提要
json - Twitter json 提要？
有谁知道我在哪里可以找到这个页面的 json 提要？ http://twitter.com/#!/microsoft 我找到的最接近的是这个: http://twitter.com/status/us
.net - 证书无效的自定义 NuGet 提要
如何使用 nuget 命令禁用 SSL 证书检查？ PS C:\Softwares> .\nuget.exe list Unable to load the service index forsour
rss - 延迟外部 RSS 提要
我订阅了许多 RSS 提要，主要来自我自己的时区(英国:目前是 GMT+1，又名 BST)。不过我也对新西兰的新闻感兴趣(目前为 GMT+12)。我的问题是由于我沉迷于需要将未读计数保持在或接近于零
image - 使用图像嵌入 Twitter 提要
关闭。这个问题需要更多focused .它目前不接受答案。想改善这个问题吗？更新问题，使其仅关注一个问题 editing this post . 6年前关闭。 Improve this questi
fullcalendar - 将日期传递给 JSON 提要
我在 vb.net 设置中使用 fullCalendar。我使用包含在 aspx 页面中的 JSON 提要将数据传递到日历，如下所示: events: "JSONcalendarFeed.aspx"
rss - 雅虎加拿大天气 RSS 提要
我的应用程序使用雅虎的天气信息 (XML) 来显示 future 5 天的天气预报。当邮政编码在美国时，这很有效。例如，下面的 url 为我提供了密歇根州富兰克林的提要。 http://xml.wea
instagram - 如何在我的网站上显示我的 Instagram 提要？
我有一个显示我的 Instagram 动态的网站。以前我在用Instagram 遵循 API。用户/ self /媒体/最近此 API 使用我生成一次的访问 token ，并在我的代码中作为变量保存
BlogSpot 的 RSS 提要
是否可以为特定关键字获取 BlogSpot 的 RSS 提要？我已尝试使用以下 URL，但它们似乎不起作用。 Atom 1.0: https://blogname.blogspot.com/
authentication - 如何验证 RSS 提要
Basecamp 对其 RSS 提要使用 HTTP 身份验证，但这意味着 Google Reader, Bloglines and Firefox/Safari RSS don't work . 是否
rss - 为不同的项目提供不同语言的 RSS 提要
是否可以创建多语言的 RSS (2.0) 提要？假设我主要用英语( en )写博客，但有时我会创建德语( de )帖子。 RSS 规范中是否对此提供支持？我在 RSS spec 中找不到任何内容在这个

首页

博学

6Ren·AI

商城

Python 博客 RSS 提要将 BeautifulSoup 输出抓取到 .txt 文件