gpt4 book ai didi

Python 博客 RSS 提要将 BeautifulSoup 输出抓取到 .txt 文件

转载 作者:太空宇宙 更新时间:2023-11-03 11:06:01 25 4
gpt4 key购买 nike

提前为接下来的长代码块道歉。我是 BeautifulSoup 的新手,但发现有一些有用的教程使用它来抓取博客的 RSS 提要。全面披露:这是改编自该视频教程的代码,它对实现这一目标非常有帮助:http://www.youtube.com/watch?v=Ap_DlSrT-iE .

这是我的问题:该视频很好地展示了如何将相关内容打印到控制台。我需要将每篇文章的文本写到一个单独的 .txt 文件并将其保存到某个目录(现在我只是想保存到我的桌面)。我知道问题出在代码末尾附近的两个 for 循环的范围内(我试图对此进行评论,以便人们快速查看——这是最后一条评论开头 # Here's where I'm lost... ),但我似乎无法自己弄清楚。

目前该程序所做的是获取程序读入的最后一篇文章中的文本,并将其写出到变量 listIterator 中指示的 .txt 文件的数量。因此,在这种情况下,我相信有 20 个 .txt 文件被写出,但它们都包含循环播放的最后一篇文章的文本。我想让程序做的是遍历每篇文章并将每篇文章的文本打印到一个单独的 .txt 文件中。抱歉冗长,但任何见解都将不胜感激。

from urllib import urlopen
from bs4 import BeautifulSoup
import re

# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)

# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))

for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]

# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()

# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")

# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]

# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)

# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')

# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''

# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()

最佳答案

之所以好像只写了最后一篇,是因为所有的文章都是一遍又一遍地写到20个单独的文件中。让我们看看以下内容:

for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()

您正在为 每次 迭代将 parag_string 一遍又一遍地写入相同的 20 个文件。你需要做的是,将所有 parag_string 附加到一个单独的列表,比如 paraStringList,然后将其所有内容写入单独的文件,如下所示:

for i, var in enumerate(paraStringList):  # Enumerate creates a tuple
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)

现在这需要在您的主循环之外,即 for i in listIterator:(...)。这是该程序的工作版本:

from urllib import urlopen
from bs4 import BeautifulSoup
import re


webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()

patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')

findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]

listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []

for i in listIterator:

print findPatTitle[i]

articlePage = urlopen(findPatLink[i]).read()

divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")

article = articlePage[divBegin:divEnd]

soup = BeautifulSoup(article)

paragList = soup.findAll('p')

para_string = ''

for i in paragList:
para_string += str(i)

paraStringList.append(para_string)

for i, var in enumerate(paraStringList):
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)

关于Python 博客 RSS 提要将 BeautifulSoup 输出抓取到 .txt 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19621473/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com