python - 使用 Python 解析非结构化文本文件-6ren

python - 使用 Python 解析非结构化文本文件

转载作者：行者123 更新时间：2023-11-28 16:32:14

我有一个文本文件，其中的一些片段如下所示:

Page 1 of 515                   
Closing Report for Company Name LLC                 

222 N 9th Street, #100 & 200, Las Vegas, NV, 89101                  

File number:    Jackie Grant    Status: Fell Thru   Primary closing party:  Seller
Acceptance: 01/01/2001  Closing date:   11/11/2011  Property type:  Commercial Lease
MLS number: Sale price: $200,000    Commission: $1,500.00   
Notes:  08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..  

Seller: Company Name LLC                    
Company name:   Company Name LLC                
Address:    222 N 9th Street, #100 & 200, Las Vegas, NV, 89101              
Home:   Pager:              
Business:   Fax:                
Mobile: Email:              
Buyer: Tomlinson, Ladainian                 
Address:    222 N 9th Street, #100 & 200, Las Vegas, NV, 89101              
Home:   Pager:              
Business:   555-555-5555    Fax:            
Mobile: Email:              
Lessee Agent: Blank, Arthur                 
Company name:   Sprockets Inc.              
Address:    5001 Old Man Dr, North Las Vegas, NV, 89002             
Home:   (575) 222-3455  Pager:          
Business:   Fax:    999-9990            
Mobile: (702) 600-3492  Email:  sprockets@yoohoo.com        
Leasing Agent: Van Uytnyck, Chameleon                   
Company name:   Company Name LLC                
Address:                    
Home:   Pager:              
Business:   Fax:    909-222-2223            
Mobile: 595-595-5959    Email:          

(should be 2 spaces here.. this is not in normal text file)


Printed on Friday, June 12, 2015                    
Account owner: Roger Goodell                    
Page 2 of 515                   
Report for Adrian (Allday) Peterson                     

242 N 9th Street, #100 & 200                    

File number:    Soap    Status: Closed/Paid Primary closing party:  Buyer
Acceptance: 01/10/2010  Closing date:   01/10/2010  Property type:  RRR
MLS number: Sale price: $299,000    Commission: 33.00%  

Seller: SOS, Bank                   
Address:    242 N 9th Street, #100 & 200                
Home:   Pager:              
Business:   Fax:                
Mobile: Email:              
Buyer: Sabel, Aaron                 
Address:                    
Home:   Pager:              
Business:   Fax:                
Mobile: Email:  sia@yoohoo.com          
Escrow Co: Schneider, Patty                 
Company name:   National Football League                
Address:    242 N 9th Street, #100 & 200                
Home:   Pager:              
Business:   800-2009    Fax:    800-1100        
Mobile: Email:              
Buyers Agent: Munchak, Mike                 
Company name:   Commission Group                
Address:                    
Home:   Pager:              
Business:   Fax:                
Mobile: 483374-3892 Email:  donation@yoohoo.net     
Listing Agent: Ricci, Christina                 
Company name:   Other Guys              
Address:                    
Home:   Pager:              
Business:   Fax:                
Mobile: 888-333-3333    Email:  general.adama@cylon.net

这是我的代码:

import re

file = open('file-path.txt','r')

# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
    line = line.rstrip('\n').strip()
    if (line == ''):
        if prev_blank == True:
            # end of the entry, create append the entry
            if(len(curr) > 0):
                entries.append(curr)
                print curr
                curr = []
                prev_blank = False
        else:
            prev_blank = True
    else:
        prev_blank = False
        # we need to parse the line
        line_list = line.split()
        str = ''
        start = False
        for item in line_list:
            if re.match('[a-zA-Z\s]+:.*',item):
                if len(str) > 0:
                    curr.append(str)
                str = item
                start = True
            elif start == True:
                str = str + ' ' + item

这是输出:

['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']

我的问题如下:

首先，应该有2条记录作为输出，我只输出一条。
在最上面的文本 block 中，我的脚本无法知道前一个值在哪里结束，而新值在哪里开始:'Status: Fell Thru' 应该是一个值，'Primary closing party:', 'BuyerAcceptance: 01/10/2010', 'Closing date: 01/10/2010', 'Property type: RRR', 'MLS number:', 'Sale price: $299,000', 'Commission: 33.00%' 应该被捕获。
一旦正确解析，我将需要再次解析以将键与值分开(即“Closing date”:01/10/2010)，最好是在字典列表中。

除了使用正则表达式挑选键，然后抓取后面的文本片段之外，我想不出更好的方法。

完成后，我想要一个 csv w/a header row 填充键，我可以导入到 pandas w/read_csv。我在这上面花了好几个小时..

最佳答案

(这不是一个完整的答案，但对于评论来说太长了)。

字段名可以有空格(例如MLS number)
每行可以显示多个字段(例如 Home: Pager:)
Notes字段里面有时间，里面有:

这意味着您不能采用您的方法通过正则表达式识别字段名。它无法知道“MLS”是先前数据值的一部分还是后续字段名的一部分。

一些 Home: Pager: 行是指卖方，一些是指买方或承租人代理人或租赁代理人。这意味着我在下面采用的天真逐行方法也不起作用。

这是我正在处理的代码，它针对您的测试数据运行，但由于上述原因给出了错误的输出。这是我正在采取的方法的引用:

replaces = [
    ('Closing Report for', 'Report_for:')
    ,('Report for', 'Report_for:')
    ,('File number', 'File_number')
    ,('Primary closing party', 'Primary_closing_party')
    ,('MLS number', 'MLS_number')
    ,('Sale Price', 'Sale_Price')
    ,('Account owner', 'Account_owner')
    # ...
    # etc.
]

def fix_linemash(data):
    # splits many fields on one line into several lines

    results = []
    mini_collection = []
    for token in data.split(' '):
        if ':' not in token:
            mini_collection.append(token)
        else:
            results.append(' '.join(mini_collection))
            mini_collection = [token]

    return [line for line in results if line]

def process_record(data):
    # takes a collection of lines
    # fixes them, and builds a record dict
    record = {}

    for old, new in replaces:
        data = data.replace(old, new)

    for line in fix_linemash(data):
        print line
        name, value = line.split(':', 1)
        record[name.strip()] = value.strip()

    return record


records = []
collection = []
blank_flag = False

for line in open('d:/lol.txt'):
    # Read through the file collecting lines and
    # looking for double blank lines
    # every pair of blank lines, process the stored ones and reset

    line = line.strip()
    if line.startswith('Page '): continue
    if line.startswith('Printed on '): continue

    if not line and blank_flag:      # record finished
        records.append( process_record(' '.join(collection)) )
        blank_flag = False
        collection = []

    elif not line:  # maybe end of record?
        blank_flag = True

    else:   # false alarm, record continues
        blank_flag = False
        collection.append(line)

for record in records:
    print record

我现在认为对数据进行一些预处理整理步骤会更好:

去掉“Page n of n”和“Printed on ...”等类似的行
识别所有有效的字段名称，然后分解码合的行，这意味着每一行只有一个字段，字段从一行的开头开始。
运行并处理 Seller/Buyer/Agents block ，用识别前缀替换字段名，例如Email: -> Seller Email:.

然后编写一个记录解析器，这应该很容易 - 检查两个空行，在第一个冒号处拆分行，使用左边的位作为字段名，右边的位作为值。随心所欲地存储(注意，字典键是无序的)。

关于python - 使用 Python 解析非结构化文本文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30768061/

文章推荐： python - 如何展平全部包含字符串的嵌套 IF 语句？

文章推荐： python - 在 `CheckConstraint` 中创建一个 `UniqueConstraint`

文章推荐： python - 如何仅在 Pandas 数据框中选择完整的

文章推荐： css - 在 LESS 中使用 ID

c - 结构 |结构/union 的不完整类型错误
我目前正在尝试基于哈希表构建字典。逻辑是:有一个名为 HashTable 的结构，其中包含以下内容: HashFunc HashFunc; PrintFunc PrintEntry; CompareF
c++ - 如何删除指向(结构/对象)的指针而不破坏(结构/对象)内部的指针？
如果我有一个指向结构/对象的指针，并且该结构/对象包含另外两个指向其他对象的指针，并且我想删除“包含这两个指针的对象而不破坏它所持有的指针”——我该怎么做这样做吗？指向对象 A 的指针(包含指向对象
go - 如何访问该“结构” slice 的新“类型”中的“结构”字段？
像这样的代码 package main import "fmt" type Hello struct { ID int Raw string } type World []*Hell
mysql - 将 CSV 移动到 MySQL 关系数据库的第一步。 CSV 结构!= MySQL 结构
我有一个采用以下格式的 CSV: Module, Topic, Sub-topic 它需要能够导入到具有以下格式的 MySQL 数据库中: CREATE TABLE `modules` ( `id
c++ - 将 POD 结构/结构 vector 复制到 vector 的最优雅方式
通常我使用类似的东西 copy((uint8_t*)&POD, (uint8_t*)(&POD + 1 ), back_inserter(rawData)); copy((uint8_t*)&PODV
apache-spark - Spark : Union can only be performed on tables with the compatible column types. 结构<名称，ID> != 结构
错误 : 联合只能在具有兼容列类型的表上执行。结构(层:字符串，skyward_number:字符串，skyward_points:字符串)<> 结构(skyward_number:字符串，层:字符
条件跳转或移动取决于未初始化的值、结构
我有一个指向结构的指针数组，我正在尝试使用它们进行 while 循环。我对如何准确初始化它并不完全有信心，但我一直这样做: Entry *newEntry = malloc(sizeof(Entry)
C "if"结构
我正在学习 C，我的问题可能很愚蠢，但我很困惑。在这样的函数中: int afunction(somevariables) { if (someconditions)
创建列表的列表(结构)
我现在正在做一项编程作业，我并没有真正完全掌握链接，因为我们还没有涉及它。但是我觉得我需要它来做我想做的事情，因为数组还不够我创建了一个结构，如下 struct node { float coef;
C符号常量+结构
给定以下代码片段: #include #include #define MAX_SIZE 15 typedef struct{ int touchdowns; int intercepti
Checknullarray 结构
struct contact list[3]; int checknullarray() { for(int x=0;x<10;x++) { if(strlen(con
javascript "for (;;);"结构
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: Empty “for” loop in Facebook ajax what does AJAX call
C# 结构 "this = ...."
我刚刚在反射器中浏览了一个文件，并在结构构造函数中看到了这个: this = new Binder.SyntaxNodeOrToken(); 我以前从未见过该术语。有人能解释一下这个赋值在 C# 中的
用于命名字符串常量的 Python 结构
我经常使用字符串常量，例如: DICT_KEY1 = 'DICT_KEY1' DICT_KEY2 = 'DICT_KEY2' ... 很多时候我不介意实际的文字是什么，只要它们是独一无二的并且对人类读
用指针初始化 C 结构
我是 C 的新手，我不明白为什么下面的代码不起作用: typedef struct{ uint8_t a; uint8_t* b; } test_struct; test_struct
可以像内置类型一样直接分配常量值的 .NET 结构
您能否制作一个行为类似于内置类之一的结构，您可以在其中直接分配值而无需调用属性？前任: RoundedDouble count; count = 5; 而不是使用 RoundedDouble cou
编译 C 结构
这是我的代码: #include typedef struct { const char *description; float value; int age; } swag
r - 重叠嵌套列表并保留命名/结构
在创建嵌套列表时，我认为 R 具有对列表元素有用的命名结构。我有一个列表列表，并希望应用包含在任何列表中的每个向量的函数。 lapply这样做但随后剥离了列表的命名结构。我该怎么办 lapply嵌套列
个人管理器的 XML 结构
我正在做一个用于学习目的的个人组织者，我从来没有使用过 XML，所以我不确定我的解决方案是否是最好的。这是我附带的 XML 文件的基本结构:
couchdb - PouchDB 结构
我是新来的 nosql概念，所以当我开始学习时 PouchDB ，我找到了这个转换表。我的困惑是，如何PouchDB如果可以说我有多个表，是否意味着我需要创建多个数据库？因为根据我在 pouchdb

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用 Python 解析非结构化文本文件