gpt4 book ai didi

Python - 从具有可变属性和行长度的文件中读取数据

转载 作者:太空宇宙 更新时间:2023-11-03 15:20:12 24 4
gpt4 key购买 nike

我正在尝试找到在 Python 中解析文件并创建命名元组列表的最佳方法,每个元组代表一个数据实体及其属性。数据看起来像这样:

UI: T020  
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality

UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]

UI: etc...

虽然有几个属性是共享的(例如 UI),但有些不是(例如 STY)。但是,我可以硬编码一个详尽的必要列表。
由于每个分组都由一个空行分隔,因此我使用了 split 以便我可以单独处理每个数据 block :

input = file.read().split("\n\n")
for chunk in input:
process(chunk)

我见过一些使用字符串查找/拼接、itertools.groupby 甚至正则表达式的方法。我正在考虑做一个'[A-Z] *:'的正则表达式来找到标题的位置,但我不确定如何在到达另一个标题之前拉出多行(例如DEF之后的多行数据)第一个示例实体)。

我很感激任何建议。

最佳答案

我假设如果您在多行上有字符串跨度,您希望用空格替换换行符(并删除任何额外的空格)。

def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}

with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)

# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''

# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}

continue

# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue

# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row

# Missed row?
if key:
data[key] = tmp

# Missed group?
if data:
yield data

此生成器在每次迭代中返回 dict 和类似 UI: T020 的对(并且总是至少有一个项目)。

因为它使用生成器和连续读取,所以它应该对大文件有效,并且它不会一次将整个文件读入内存。

这是一个小演示:

for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])

print()

和实际输出:

--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab

--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX

关于Python - 从具有可变属性和行长度的文件中读取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16179020/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com