gpt4 book ai didi

python - 使用 Python 解析非结构化文本文件

转载 作者:行者123 更新时间:2023-11-28 16:32:14 28 4
gpt4 key购买 nike

我有一个文本文件,其中的一些片段如下所示:

Page 1 of 515                   
Closing Report for Company Name LLC

222 N 9th Street, #100 & 200, Las Vegas, NV, 89101

File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..

Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets@yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:

(should be 2 spaces here.. this is not in normal text file)


Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson

242 N 9th Street, #100 & 200

File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%

Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia@yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation@yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama@cylon.net

这是我的代码:

import re

file = open('file-path.txt','r')

# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item

这是输出:

['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']

我的问题如下:

  1. 首先,应该有2条记录作为输出,我只输出一条。
  2. 在最上面的文本 block 中,我的脚本无法知道前一个值在哪里结束,而新值在哪里开始:'Status: Fell Thru' 应该是一个值,'Primary closing party:', 'BuyerAcceptance: 01/10/2010', 'Closing date: 01/10/2010', 'Property type: RRR', 'MLS number:', 'Sale price: $299,000', 'Commission: 33.00%' 应该被捕获。
  3. 一旦正确解析,我将需要再次解析以将键与值分开(即“Closing date”:01/10/2010),最好是在字典列表中。

除了使用正则表达式挑选键,然后抓取后面的文本片段之外,我想不出更好的方法。

完成后,我想要一个 csv w/a header row 填充键,我可以导入到 pandas w/read_csv。我在这上面花了好几个小时..

最佳答案

(这不是一个完整的答案,但对于评论来说太长了)。

  • 字段名可以有空格(例如MLS number)
  • 每行可以显示多个字段(例如 Home: Pager:)
  • Notes字段里面有时间,里面有:

这意味着您不能采用您的方法通过正则表达式识别字段名。它无法知道“MLS”是先前数据值的一部分还是后续字段名的一部分。

一些 Home: Pager: 行是指卖方,一些是指买方或承租人代理人或租赁代理人。这意味着我在下面采用的天真逐行方法也不起作用。

这是我正在处理的代码,它针对您的测试数据运行,但由于上述原因给出了错误的输出。这是我正在采取的方法的引用:

replaces = [
('Closing Report for', 'Report_for:')
,('Report for', 'Report_for:')
,('File number', 'File_number')
,('Primary closing party', 'Primary_closing_party')
,('MLS number', 'MLS_number')
,('Sale Price', 'Sale_Price')
,('Account owner', 'Account_owner')
# ...
# etc.
]

def fix_linemash(data):
# splits many fields on one line into several lines

results = []
mini_collection = []
for token in data.split(' '):
if ':' not in token:
mini_collection.append(token)
else:
results.append(' '.join(mini_collection))
mini_collection = [token]

return [line for line in results if line]

def process_record(data):
# takes a collection of lines
# fixes them, and builds a record dict
record = {}

for old, new in replaces:
data = data.replace(old, new)

for line in fix_linemash(data):
print line
name, value = line.split(':', 1)
record[name.strip()] = value.strip()

return record


records = []
collection = []
blank_flag = False

for line in open('d:/lol.txt'):
# Read through the file collecting lines and
# looking for double blank lines
# every pair of blank lines, process the stored ones and reset

line = line.strip()
if line.startswith('Page '): continue
if line.startswith('Printed on '): continue

if not line and blank_flag: # record finished
records.append( process_record(' '.join(collection)) )
blank_flag = False
collection = []

elif not line: # maybe end of record?
blank_flag = True

else: # false alarm, record continues
blank_flag = False
collection.append(line)

for record in records:
print record

我现在认为对数据进行一些预处理整理步骤会更好:

  1. 去掉“Page n of n”和“Printed on ...”等类似的行
  2. 识别所有有效的字段名称,然后分解码合的行,这意味着每一行只有一个字段,字段从一行的开头开始。
  3. 运行并处理 Seller/Buyer/Agents block ,用识别前缀替换字段名,例如Email: -> Seller Email:.

然后编写一个记录解析器,这应该很容易 - 检查两个空行,在第一个冒号处拆分行,使用左边的位作为字段名,右边的位作为值。随心所欲地存储(注意,字典键是无序的)。

关于python - 使用 Python 解析非结构化文本文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30768061/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com