gpt4 book ai didi

python - 将 PDF 的 CreationTime 转换为 Python 中的可读格式

转载 作者:太空狗 更新时间:2023-10-30 00:41:13 25 4
gpt4 key购买 nike

我正在使用 Python 处理 PDF,我正在使用 PDFMiner 访问文件的元数据。我使用这个提取信息:

from pdfminer.pdfparser import PDFParser, PDFDocument    
fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()

print doc.info[0]['CreationDate']
# And return this value "D:20130501200439+01'00'"

如何将 D:20130501200439+01'00' 转换为 Python 中可读的格式?

最佳答案

我发现记录的格式 here .我也需要处理时区问题,因为我有来自各地的 16 万份文件需要处理。这是我的完整解决方案:

import datetime
import re
from dateutil.tz import tzutc, tzoffset


pdf_date_pattern = re.compile(''.join([
r"(D:)?",
r"(?P<year>\d\d\d\d)",
r"(?P<month>\d\d)",
r"(?P<day>\d\d)",
r"(?P<hour>\d\d)",
r"(?P<minute>\d\d)",
r"(?P<second>\d\d)",
r"(?P<tz_offset>[+-zZ])?",
r"(?P<tz_hour>\d\d)?",
r"'?(?P<tz_minute>\d\d)?'?"]))


def transform_date(date_str):
"""
Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime
http://www.verypdf.com/pdfinfoeditor/pdf-date-format.htm
(D:YYYYMMDDHHmmSSOHH'mm')
:param date_str: pdf date string
:return: datetime object
"""
global pdf_date_pattern
match = re.match(pdf_date_pattern, date_str)
if match:
date_info = match.groupdict()

for k, v in date_info.iteritems(): # transform values
if v is None:
pass
elif k == 'tz_offset':
date_info[k] = v.lower() # so we can treat Z as z
else:
date_info[k] = int(v)

if date_info['tz_offset'] in ('z', None): # UTC
date_info['tzinfo'] = tzutc()
else:
multiplier = 1 if date_info['tz_offset'] == '+' else -1
date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))

for k in ('tz_offset', 'tz_hour', 'tz_minute'): # no longer needed
del date_info[k]

return datetime.datetime(**date_info)

关于python - 将 PDF 的 CreationTime 转换为 Python 中的可读格式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16503075/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com