gpt4 book ai didi

python - 尝试编写文本文件并将其与抓取的文本进行比较,但不太有效

转载 作者:太空宇宙 更新时间:2023-11-03 15:01:22 24 4
gpt4 key购买 nike

我正在尝试编写一个程序,从网页中提取 html,然后将其与我之前保存的抓取数据进行比较。如果发生了变化,它会将新的 html 保存到文本文件中并通过电子邮件发送给我。问题是,它要么偶尔向文本文件写入文本,要么根本不写入文本,然后即使没有任何更改,也会随机向我发送电子邮件。我已经玩了两周了,似乎不明白发生了什么。救命!

import requests
import smtplib
import bs4
import os

abbvs = ['MCL', 'PFL', 'OPPL', 'FCPL', 'AnyPL', 'NOLS', 'VanWaPL', 'SLCPL', 'ProPL', 'ArapPL']
openurls = open('/home/ian/PythonPrograms/job-scrape/urls', 'r')
urls = openurls.read().strip('\n').split(',')
olddocs = ['oldMCL', 'oldPFL', 'oldOPPL', 'oldFCPL', 'oldAnyPL', 'oldNOLS', 'oldVanWaPL', 'oldSLCPL', 'oldProPL', 'oldArapPL']
newdocs = ['newMCL', 'newPFL', 'newOPPL', 'newFCPL', 'newAnyPL', 'newNOLS', 'newVanWaPL', 'newSLCPL', 'newProPL', 'newArapPL']
bstags = ['#content', '.col-md-12', '#main', '#containedInVSplit', '.col-sm-7', '.statement-left-div', '#main', '#main', '#componentBox', '.list-group.job-listings']

for url in urls:
res = requests.get(url)
res.raise_for_status()
for bstag in bstags:
currentsoup = bs4.BeautifulSoup(res.text, "lxml")
newsoup = currentsoup.select(bstag)
for newdoc in newdocs:
if os.path.isfile('/home/ian/Pythonprograms/job-scrape/libsitehtml/'+newdoc) == False:
createnew = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc, 'w')

file = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc, 'w')
file.write(str(newsoup))
file.close()

new = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+newdoc)
new = new.read()
for olddoc in olddocs:
if os.path.isfile('/home/ian/Pythonprograms/job-scrape/libsitehtml/'+olddoc) == False:
createold = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc, 'w')

old = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc)
old = old.read()

if str(old) != str(new):
file = open('/home/ian/PythonPrograms/job-scrape/libsitehtml/'+olddoc, 'w')
file.write(str(new))
file.close()

server = smtplib.SMTP('smtp.gmail.com', 587)
server.ehlo()
server.starttls()
server.login('dummyemail', 'password')
server.sendmail('noreply.job.updates.com', 'myemail', 'Subject: A library\'s jobs page has changed\n' '\n' + 'Here\'s the URL:' + str(url))
server.quit()
elif str(old) == str(new):
pass

最佳答案

您的代码存在一些问题。主要问题是每个循环都运行到完成,使您只能有效地检查最后一个站点。您需要对每组 abbvurlbstag 运行比较。为此,有一个很好理解的 Python 函数,名为 zip(),它很容易理解。

此外,您不需要存储新抓取的数据,因为它可以直接与旧数据进行比较(然后仅在发生更改时进行更新)。经过这些更改,您的代码可能如下所示:

import requests
import smtplib
import bs4
import os

abbvs = ['MCL', 'PFL', 'OPPL', 'FCPL', 'AnyPL', 'NOLS', 'VanWaPL', 'SLCPL', 'ProPL', 'ArapPL']
openurls = open('/home/ian/PythonPrograms/job-scrape/urls', 'r')
urls = openurls.read().strip('\n').split(',')
bstags = ['#content', '.col-md-12', '#main', '#containedInVSplit', '.col-sm-7', '.statement-left-div', '#main', '#main', '#componentBox', '.list-group.job-listings']

for abbv, url, bstag in zip(abbvs, urls, bstags):
res = requests.get(url)
res.raise_for_status()
olddoc = 'old'+abbv
currentsoup = bs4.BeautifulSoup(res.text, "lxml")
newsoup = str(currentsoup.select(bstag))

filepath = '/home/ian/Pythonprograms/job-scrape/libsitehtml/'+olddoc
if os.path.isfile(filepath):
with open(filepath) as old:
oldsoup = old.read()
else:
oldsoup = ''

if newsoup != oldsoup:
with open(filepath, 'w') as new:
new.write(newsoup)
server = smtplib.SMTP('smtp.gmail.com', 587)
server.ehlo()
server.starttls()
server.login('dummyemail', 'password')
server.sendmail('noreply.job.updates.com', 'myemail', 'Subject: A library\'s jobs page has changed\n' '\n' + 'Here\'s the URL:' + str(url))
server.quit()

不过,我尚未测试上述内容,因此它可能包含一些错误。但这应该是一个开始的事情。此外,您应该考虑尝试创建一个以 abbvs 作为键、urls 作为值的 dict,因为它们紧密相连。

关于python - 尝试编写文本文件并将其与抓取的文本进行比较,但不太有效,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45068080/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com