gpt4 book ai didi

python - 如何使用 Python 解析复杂的文本文件?

转载 作者:IT老高 更新时间:2023-10-28 20:51:00 26 4
gpt4 key购买 nike

我正在寻找一种将复杂文本文件解析为 pandas DataFrame 的简单方法。下面是一个示例文件,我希望解析后的结果是什么,以及我当前的方法。

有什么方法可以让它更简洁/更快/更pythonic/更具可读性?

我也在 Code Review 上提出了这个问题.

我最终写了一个 blog article to explain this to beginners .

这是一个示例文件:

Sample text

A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.

School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel

Student number, Score
0, 3
1, 7

Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora

Student number, Score
0, 6
1, 3
2, 9

School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna

Student number, Score
0, 8
1, 7

Grade = 2
Student number, Name
0, Harry
1, Hermione

Student number, Score
0, 5
1, 10

Grade = 3
Student number, Name
0, Fred
1, George

Student number, Score
0, 0
1, 0

这是我希望解析后的结果:

                                         Name  Score
School Grade Student number
Hogwarts 1 0 Ginny 8
1 Luna 7
2 0 Harry 5
1 Hermione 10
3 0 Fred 0
1 George 0
Riverdale High 1 0 Phoebe 3
1 Rachel 7
2 0 Angela 6
1 Tristan 3
2 Aurora 9

这是我目前的解析方式:

import re
import pandas as pd


def parse(filepath):
"""
Parse text at given filepath

Parameters
----------
filepath : str
Filepath for file to be parsed

Returns
-------
data : pd.DataFrame
Parsed data

"""

data = []
with open(filepath, 'r') as file:
line = file.readline()
while line:
reg_match = _RegExLib(line)

if reg_match.school:
school = reg_match.school.group(1)

if reg_match.grade:
grade = reg_match.grade.group(1)
grade = int(grade)

if reg_match.name_score:
value_type = reg_match.name_score.group(1)
line = file.readline()
while line.strip():
number, value = line.strip().split(',')
value = value.strip()
dict_of_data = {
'School': school,
'Grade': grade,
'Student number': number,
value_type: value
}
data.append(dict_of_data)
line = file.readline()

line = file.readline()

data = pd.DataFrame(data)
data.set_index(['School', 'Grade', 'Student number'], inplace=True)
# consolidate df to remove nans
data = data.groupby(level=data.index.names).first()
# upgrade Score from float to integer
data = data.apply(pd.to_numeric, errors='ignore')
return data


class _RegExLib:
"""Set up regular expressions"""
# use https://regexper.com to visualise these if required
_reg_school = re.compile('School = (.*)\n')
_reg_grade = re.compile('Grade = (.*)\n')
_reg_name_score = re.compile('(Name|Score)')

def __init__(self, line):
# check whether line has a positive match with all of the regular expressions
self.school = self._reg_school.match(line)
self.grade = self._reg_grade.match(line)
self.name_score = self._reg_name_score.search(line)


if __name__ == '__main__':
filepath = 'sample.txt'
data = parse(filepath)
print(data)

最佳答案

2019 年更新(PEG 解析器):

这个答案受到了相当多的关注,所以我觉得添加另一种可能性,即解析选项。在这里,我们可以使用 PEG 解析器(例如 parsimonious )结合 NodeVisitor 类:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import pandas as pd
grammar = Grammar(
r"""
schools = (school_block / ws)+

school_block = school_header ws grade_block+
grade_block = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws?

school_header = ~"^School = (.*)"m
grade_header = ~"^Grade = (\d+)"m
name_header = "Student number, Name"
score_header = "Student number, Score"

number_name = index comma name ws
number_score = index comma score ws

comma = ws? "," ws?

index = number+
score = number+

number = ~"\d+"
name = ~"[A-Z]\w+"
ws = ~"\s*"
"""
)

tree = grammar.parse(data)

class SchoolVisitor(NodeVisitor):
output, names = ([], [])
current_school, current_grade = None, None

def _getName(self, idx):
for index, name in self.names:
if index == idx:
return name

def generic_visit(self, node, visited_children):
return node.text or visited_children

def visit_school_header(self, node, children):
self.current_school = node.match.group(1)

def visit_grade_header(self, node, children):
self.current_grade = node.match.group(1)
self.names = []

def visit_number_name(self, node, children):
index, name = None, None
for child in node.children:
if child.expr.name == 'name':
name = child.text
elif child.expr.name == 'index':
index = child.text

self.names.append((index, name))

def visit_number_score(self, node, children):
index, score = None, None
for child in node.children:
if child.expr.name == 'index':
index = child.text
elif child.expr.name == 'score':
score = child.text

name = self._getName(index)

# build the entire entry
entry = (self.current_school, self.current_grade, index, name, score)
self.output.append(entry)

sv = SchoolVisitor()
sv.visit(tree)

df = pd.DataFrame.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)

正则表达式选项(原始答案)

那么,第 x 次观看《指环王》时,我不得不在最后一集之前架起桥梁:


分解后,想法是将问题分解为几个较小的问题:

  1. 将每所学校分开
  2. ...每个年级
  3. ...学生和成绩
  4. ...之后将它们绑定(bind)在一个数据框中


学校部分(见 a demo on regex101.com)

^
School\s*=\s*(?P<school_name>.+)
(?P<school_content>[\s\S]+?)
(?=^School|\Z)


成绩部分( another demo on regex101.com)

^
Grade\s*=\s*(?P<grade>.+)
(?P<students>[\s\S]+?)
(?=^Grade|\Z)


学生/分数部分( last demo on regex101.com):

^
Student\ number,\ Name[\n\r]
(?P<student_names>(?:^\d+.+[\n\r])+)
\s*
^
Student\ number,\ Score[\n\r]
(?P<student_scores>(?:^\d+.+[\n\r])+)

其余的是生成器表达式,然后将其馈送到 DataFrame 构造函数(连同列名)。


代码:

import pandas as pd, re

rx_school = re.compile(r'''
^
School\s*=\s*(?P<school_name>.+)
(?P<school_content>[\s\S]+?)
(?=^School|\Z)
''', re.MULTILINE | re.VERBOSE)

rx_grade = re.compile(r'''
^
Grade\s*=\s*(?P<grade>.+)
(?P<students>[\s\S]+?)
(?=^Grade|\Z)
''', re.MULTILINE | re.VERBOSE)

rx_student_score = re.compile(r'''
^
Student\ number,\ Name[\n\r]
(?P<student_names>(?:^\d+.+[\n\r])+)
\s*
^
Student\ number,\ Score[\n\r]
(?P<student_scores>(?:^\d+.+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)


result = ((school.group('school_name'), grade.group('grade'), student_number, name, score)
for school in rx_school.finditer(string)
for grade in rx_grade.finditer(school.group('school_content'))
for student_score in rx_student_score.finditer(grade.group('students'))
for student in zip(student_score.group('student_names')[:-1].split("\n"), student_score.group('student_scores')[:-1].split("\n"))
for student_number in [student[0].split(", ")[0]]
for name in [student[0].split(", ")[1]]
for score in [student[1].split(", ")[1]]
)

df = pd.DataFrame(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)


精简:

rx_school = re.compile(r'^School\s*=\s*(?P<school_name>.+)(?P<school_content>[\s\S]+?)(?=^School|\Z)', re.MULTILINE)
rx_grade = re.compile(r'^Grade\s*=\s*(?P<grade>.+)(?P<students>[\s\S]+?)(?=^Grade|\Z)', re.MULTILINE)
rx_student_score = re.compile(r'^Student number, Name[\n\r](?P<student_names>(?:^\d+.+[\n\r])+)\s*^Student number, Score[\n\r](?P<student_scores>(?:^\d+.+[\n\r])+)', re.MULTILINE)


这会产生

            School Grade Student number      Name Score
0 Riverdale High 1 0 Phoebe 3
1 Riverdale High 1 1 Rachel 7
2 Riverdale High 2 0 Angela 6
3 Riverdale High 2 1 Tristan 3
4 Riverdale High 2 2 Aurora 9
5 Hogwarts 1 0 Ginny 8
6 Hogwarts 1 1 Luna 7
7 Hogwarts 2 0 Harry 5
8 Hogwarts 2 1 Hermione 10
9 Hogwarts 3 0 Fred 0
10 Hogwarts 3 1 George 0


至于 timing,这是运行一万次的结果:

import timeit
print(timeit.timeit(makedf, number=10**4))
# 11.918397722000009 s

关于python - 如何使用 Python 解析复杂的文本文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47982949/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com