gpt4 book ai didi

python - 使用 Regex 和 BeautifulSoup 在 Python 中解析字符串

转载 作者:太空宇宙 更新时间:2023-11-04 15:05:54 25 4
gpt4 key购买 nike

我有一系列字符串,都类似于“2014 年 12 月 27 日星期六”,我想扔掉“星期六”并保存名称为“141227”的文件,即年 + 月 + 日。到目前为止,一切正常,除了我无法让 daypos 或 yearpos 的正则表达式正常工作。他们都给出了同样的错误:

Traceback (most recent call last): File "scrapewaybackblog.py", line 17, in daypos = byline.find(re.compile("[A-Z][a-z]*\s")) TypeError: expected a character buffer object

什么是字符缓冲区对象?这是否意味着我的表达有问题?这是我的脚本:

for i in xrange(3, 1, -1):
page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('div', attrs={'class': 'blog-box'})
for div in snippet:
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

monthpos = byline.find(",")
daypos = byline.find(re.compile("[A-Z][a-z]*\s"))
yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s"))
endpos = monthpos + len(byline)

month = byline[monthpos+1:daypos]
day = byline[daypos+0:yearpos]
year = byline[yearpos+2:endpos]

output_files_pathname = 'Data/' # path where output will go
new_filename = year + month + day + ".txt"
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(date)
outfile.write("\n")
outfile.write(text)
outfile.close()
print "finished another url from page {}".format(i)

我也没有想出如何使 December = 12,但那是另一次。请帮我找到合适的职位。

最佳答案

不是用正则表达式解析日期字符串,而是用 dateutil 解析它:

from dateutil.parser import parse

for div in soup.select('div.blog-box'):
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

dt = parse(byline)
new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
...

或者,您可以使用datetime.strptime() 解析字符串,但您需要注意suffixes。 :

byline = re.sub(r"(?<=\d)(st|nd|rd|th)", "", byline)
dt = datetime.strptime(byline, '%A, %B %d %Y')

re.sub() 在这里找到stndrdth字符串after a digit并用空字符串替换后缀。在它之后,日期字符串将匹配 '%A, %B %d %Y' 格式,请参阅:


一些补充说明:

固定版本:

import os
import urllib2

from bs4 import BeautifulSoup
from dateutil.parser import parse


for i in xrange(3, 1, -1):
page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
soup = BeautifulSoup(page)

for div in soup.select('div.blog-box'):
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

dt = parse(byline)

new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
with open(os.path.join('Data', new_filename), 'w') as outfile:
outfile.write(byline)
outfile.write("\n")
outfile.write(text)

print "finished another url from page {}".format(i)

关于python - 使用 Regex 和 BeautifulSoup 在 Python 中解析字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27672665/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com