gpt4 book ai didi

python - 在 Python 中解析非结构化文本

转载 作者:太空狗 更新时间:2023-10-30 01:39:59 24 4
gpt4 key购买 nike

我想解析一个包含非结构化文本的文本文件。我需要获取地址、出生日期、姓名、性别和 ID。

. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13

在上面的示例中,前 3 行是地址,只有“F”的那一行是性别,DOB 是“F”之后的行,DOB 之后的名字,名字之后的 ID,和没有。 ID下的12是索引/记录号。

但是,格式不一致。在第二组中,地址是 4 行而不是 3 行,索引/记录号。附加在姓名之后(如果此人没有 ID 字段)。

我想将文本重写为以下格式:

name, ID, address, sex, DOB

最佳答案

这是对 pyparsing 解决方案的第一次尝试 (easy-to-copy code at the pyparsing pastebin)。根据交错的评论浏览各个部分。

data = """\
. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13
"""

from pyparsing import LineEnd, oneOf, Word, nums, Combine, restOfLine, \
alphanums, Suppress, empty, originalTextFor, OneOrMore, alphas, \
Group, ZeroOrMore

NL = LineEnd().suppress()
gender = oneOf("M F")
integer = Word(nums)
date = Combine(integer + '/' + integer + '/' + integer)

# define the simple line definitions
gender_line = gender("sex") + NL
dob_line = date("DOB") + NL
name_line = restOfLine("name") + NL
id_line = Word(alphanums+"-")("ID") + NL
recnum_line = integer("recnum") + NL

# define forms of address lines
first_addr_line = Suppress('.') + empty + restOfLine + NL
# a subsequent address line is any line that is not a gender definition
subsq_addr_line = ~(gender_line) + restOfLine + NL

# a line with a name and a recnum combined, if there is no ID
name_recnum_line = originalTextFor(OneOrMore(Word(alphas+',')))("name") + \
integer("recnum") + NL

# defining the form of an overall record, either with or without an ID
record = Group((first_addr_line + ZeroOrMore(subsq_addr_line))("address") +
gender_line +
dob_line +
((name_line +
id_line +
recnum_line) |
name_recnum_line))

# parse data
records = OneOrMore(record).parseString(data)

# output the desired results (note that address is actually a list of lines)
for rec in records:
if rec.ID:
print "%(name)s, %(ID)s, %(address)s, %(sex)s, %(DOB)s" % rec
else:
print "%(name)s, , %(address)s, %(sex)s, %(DOB)s" % rec
print

# how to access the individual fields of the parsed record
for rec in records:
print rec.dump()
print rec.name, 'is', rec.sex
print

打印:

ALOMO, TERESITA CABALLES, 3412-00000-A1652TCA2, ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], F, 01/16/1952
AMURAO, CALIXTO MANALO, , ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], M, 10/14/1967

['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'F', '01/16/1952', 'ALOMO, TERESITA CABALLES', '3412-00000-A1652TCA2', '12']
- DOB: 01/16/1952
- ID: 3412-00000-A1652TCA2
- address: ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: ALOMO, TERESITA CABALLES
- recnum: 12
- sex: F
ALOMO, TERESITA CABALLES is F

['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'M', '10/14/1967', 'AMURAO, CALIXTO MANALO', '13']
- DOB: 10/14/1967
- address: ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: AMURAO, CALIXTO MANALO
- recnum: 13
- sex: M
AMURAO, CALIXTO MANALO is M

关于python - 在 Python 中解析非结构化文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1419653/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com