gpt4 book ai didi

两个分隔符之间的 Python line.split

转载 作者:行者123 更新时间:2023-11-28 21:51:18 26 4
gpt4 key购买 nike

我有一个包含以下数据的文本文件:

Schema:
Column Name Localized Name Type MaxLength
---------------------------- ---------------------------- ------ ---------
Raw Binary Binary 16384

Row 1:
Binary:
-----BEGIN-----
fdsfdsfdasadsad
fsdfafsdafsadfa
fsdafadsfadsfdsa
-----END-----


Row 2:
Binary:
-----BEGIN-----
fsdfdssd
fdsfadsfasd
fsdafdsa
-----END-----


Row 3:
Binary:
-----BEGIN-----
fsdafadsds
fsdafasdsda
fdsafadssad
-----END-----

我需要将“-----BEGIN-----”和“-----END-----”分隔符之间的数据提取到一个数组中。

这是我试过的:

data = open("test_data.txt", 'r')
result = [line.split('-----BEGIN-----') for line in data.readlines()]
print data

然而,这显然会获取“-----BEGIN-----”分隔符之后的所有数据。

如何添加结束分隔符?

请注意该文件非常大,大约 1GB。

最佳答案

对于 和 之间的多行,您希望将数据分成多个部分,只需捕获以 -----BEGIN-.. 开头的每个 block ,并继续添加行,直到到达 END:

with open("file.txt") as f:
out = []
for line in f:
if line.rstrip() == "-----BEGIN-----":
tmp = []
for line in f:
if line.rstrip() == "-----END-----":
out.append(tmp)
break
tmp.append(line)

这些部分将被分成子列表:

 [['fdsfdsfdasadsad\n', 'fsdfafsdafsadfa\n', 'fsdafadsfadsfdsa\n'],   ['fsdfdssd\n', 'fdsfadsfasd\n', 'fsdafdsa \n'], ['fsdafadsds\n', 'fsdafasdsda\n', 'fdsafadssad\n']]

使用 with 打开你的文件并且不要调用 readlines 除非你想要一个列表,你可以像上面那样遍历文件对象而不用将所有内容存储在内存中。

或者使用 itertools.takewhile 获取部分:

from itertools import takewhile, imap
with open("file.txt") as f:
f = imap(str.rstrip,f) # use map for python3
out = [list(takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----"]
print(out)

[['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa'],
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa'],
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']]

如果您想要一个可以链接的所有单词的列表:

from itertools import takewhile,chain, imap
with open("file.txt") as f:
f = imap(str.rstrip,f)
out = chain.from_iterable(takewhile(lambda x: x != "-----END-----",f) for line in f if line == "-----BEGIN-----")
print(list(out))

['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa',
'fsdfdssd', 'fdsfadsfasd', 'fsdafdsa', 'fsdafadsds', 'fsdafasdsda', 'fdsafadssad']

文件对象返回它自己的迭代器,所以每次我们迭代或调用 takewhile 时我们消耗行,takewhile 将继续获取行直到我们点击 -----END---- 然后我们继续迭代直到我们遇到另一个 -----BEGIN----- 行,如果这些行总是以 - 开头并且没有其他行那么你可以检查该条件即 if line[0] == "-"x[0] != "-" 而不是检查整行。

如果您想处理每个部分,您可以使用生成器表达式并处理每个部分的行:

with open("file.txt") as f:
f = imap(str.rstrip,f)
out = ((takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----")
for sec in out:
print(list(sec))

['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa']
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa']
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']

如果你想要一个单一的字符串调用join:

with open("file.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(chain.from_iterable(takewhile(lambda x: x != end,f)
for line in f if line == st))
print(out)

输出:

fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsafsdfdssdfdsfadsfasdfsdafdsafsdafadsdsfsdafasdsdafdsafadssad

获取单个字符串保持 -----BEGIN----------END-----

with open("out.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(["{}{}{}".format(st, "".join(takewhile(lambda x: x != end, f)), end)
for line in f if line == st])

输出:

-----BEGIN-----fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsa-----END----------BEGIN-----fsdfdssdfdsfadsfasdfsdafdsa-----END----------BEGIN-----fsdafadsdsfsdafasdsdafdsafadssad-----END-----

关于两个分隔符之间的 Python line.split,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30545497/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com