gpt4 book ai didi

Python-解析结构化文本到Excel

转载 作者:行者123 更新时间:2023-12-01 05:36:28 26 4
gpt4 key购买 nike

我需要将大量结构化文本格式的文件转换为 Excel(csv 可以),以便能够将它们与我拥有的其他一些数据合并。以下是文本示例:

   FILER:

COMPANY DATA:
COMPANY CONFORMED NAME: NORTHQUEST CAPITAL FUND INC
CENTRAL INDEX KEY: 0001142728
IRS NUMBER: 223772454
STATE OF INCORPORATION: NJ
FISCAL YEAR END: 1231

FILING VALUES:
FORM TYPE: NSAR-A
SEC ACT: 1940 Act
SEC FILE NUMBER: 811-10419
FILM NUMBER: 03805344

BUSINESS ADDRESS:
STREET 1: 16 RIMWOOD LANE
CITY: COLTS NECK
STATE: NJ
ZIP: 07722
BUSINESS PHONE: 7328423504

FORMER COMPANY:
FORMER CONFORMED NAME: NORTHPOINT CAPITAL FUND INC
DATE OF NAME CHANGE: 20010615
</SEC-HEADER>
<DOCUMENT>
<TYPE>NSAR-A
<SEQUENCE>1
<FILENAME>answer.fil
<DESCRIPTION>ANSWER.FIL
<TEXT>
<PAGE> PAGE 1
000 A000000 06/30/2003
000 C000000 0001142728
000 D000000 N
000 E000000 NF
000 F000000 Y
000 G000000 N
000 H000000 N
000 I000000 6.1
000 J000000 A
001 A000000 NORTHQUEST CAPITAL FUND, INC.
001 B000000 811-10493
001 C000000 7328921057
002 A000000 16 RIMWOOD LANE
002 B000000 COLTS NECK
002 C000000 NJ
002 D010000 07722
003 000000 N
004 000000 N
005 000000 N
006 000000 N
007 A000000 N
007 B000000 0
007 C010100 1
007 C010200 2
007 C010300 3
007 C010400 4
007 C010500 5
007 C010600 6
007 C010700 7
007 C010800 8
007 C010900 9
007 C011000 10
008 A000001 EMERALD RESEARCH CORP.
008 B000001 A
008 C000001 801-60455
008 D010001 BRICK
008 D020001 NJ
008 D030001 08724
013 A000001 SANVILLE & COMPANY
013 B010001 ABINGTON
013 B020001 PA
013 B030001 19001
015 A000001 FLEET BANK
015 B000001 C
015 C010001 POINT PLEASANT BEACH
015 C020001 NJ
015 C030001 08742
015 E030001 X
018 000000 Y
019 A000000 N
019 B000000 0
<PAGE> PAGE 2
020 A000001 SCHWAB
020 B000001 94-1737782
020 C000001 0
020 A000002 BESTVEST BROOKERAGE
020 B000002 23-1452837
020 C000002 0

继续到第 8 页,结构相同。有关公司名称的信息应放入相关列中,其余部分应类似于前两个值是列名称,第三个值是行值。

我试图用 pyparsing 来解决这个问题,但未能成功。对该方法的任何评论都会有所帮助。

最佳答案

按照您描述的方式,它们就像每个文件的键:值对。我会像这样处理解析部分:

import sys
import re
import csv

colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')

matchers = [colonseperated, fixedfields]

outfile = csv.writer(open('out.csv', 'w'))

outfile.writerow(['Filename', 'Key', 'Value'])
for filename in sys.argv[1:]:
for line in open(filename):
line = line.strip()
for matcher in matchers:
match = matcher.match(line)
if match:
outfile.writerow([filename] + list(match.groups()))

您可以将其命名为 parser.py ,并使用 python parser.py *.infile 或任何您的文件名约定来调用它。它将创建一个包含三列的 csv 文件:文件名、键和值。您可以在 Excel 中打开它,然后使用数据透视表将值转换为正确的格式。

或者你可以使用这个:

import csv

headers = []
rows = {}
filenames = []

outfile = csv.writer(open('flat.csv', 'w'))
infile = csv.reader(open('out.csv'))
infile.next()

for filename, key, value in infile:
if not filename in rows:
rows[filename] = {}
filenames.append(filename)
if key not in headers:
headers.append(key)
rows[filename][key] = value

outfile.writerow(headers)
for filename in filenames:
outfile.writerow([rows[filename].get(header, '') for header in headers])

关于Python-解析结构化文本到Excel,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18927396/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com