gpt4 book ai didi

python - 如何从文本文件中提取特定部分?

转载 作者:太空宇宙 更新时间:2023-11-03 14:39:12 24 4
gpt4 key购买 nike

我有一个类似于以下格式的文本文件。 (步骤.txt)

This is the first line of the file.
here we tell you to make a tea.

step 1

Pour more than enough water for a cup of tea into a regular pot, and bring it to a boil.

step
2

This will prevent the steeping water from dropping in temperature as soon as it is poured in.

step 3


When using tea bags, the measuring has already been done for you - generally it's one tea bag per cup.


我正在尝试在字典中获取步骤,例如steps_dic['step 1'] = '将足够一杯茶的水倒入普通锅中,然后煮沸。等等。**有时步数#会在下一行**我正在阅读该文件并在 python 中为迭代器编写了一个包装器来解析代码中的行并检查 hasnext()。

 def step_check(line,prev):
if line:
self.reg1 = re.match(r'^step(\d|\s\d)',line)
if self.reg1:
self._reg1 = self.reg1.group()
# print("in reg1: {} ".format(self._reg1))
if line and prev:
self.only_step = re.match(r'^step$',prev)
if self.only_step:
self._only_step = self.only_step.group()
# print("int only step : {} ".format(self._only_step))
self.only_digit = re.match(r'\d', line)
if self.only_digit:
self._only_digit = self.only_digit.group()
# print("in only digit: {} ".format(self._only_digit))

if self._reg1:
self.step = self._reg1
# print("Returning.. {} ".format(self.step))
return self.step
if self._only_step:
if self._only_digit:
# print("Only Step : {} ".format(self._only_step))
# print ("Only Digit: {} ".format(self._only_digit))
self.step =self._only_step+" "+self._only_digit
# print("Returning.. {} ".format(self.step))
return self.step
else:
# print("Returning.. {} ".format(self.step))
return self.step
with open(file_name, 'r', encoding='utf-8') as f:
self.steps_dict = dict()
self.lines = hn_wrapper(f.readlines())#Wrapper code not including
self.prev,self.line = None,self.lines.next()
self.first_line = self.line
self.prev, self.line = self.line, self.lines.next()
try:
while(self.lines.hasnext()):
self.prev,self.line = self.line,self.lines.next()

print (self.line)
self.step_name = self.step_check(self.line,self.prev)
if self.step_name:
self.steps_dict[self.step_name]=''
self.prev, self.line = self.line, self.lines.next()
while(not self.step_check(self.line,self.prev)):
self.steps_dict[self.step_name] = self.steps_dict[self.step_name]+ self.line + "\n"
self.prev,self.line = self.line,self.lines.next()

我只能得到step_dic['步骤 1'] = ......step_dic['第 3 步'] = ....... 但是第 2 步被错过了。我还需要提取 step_dic['step 2']。我无法了解文本缓冲区的前导方式。

最佳答案

你可以把整个文件读入内存然后运行

re.findall(r'^step\s*(\d+)\s*(.*?)\s*(?=^step\s*\d|\Z)', text, re.DOTALL | re.MULTILINE)

参见 regex demo

详情

  • ^ - 行首
  • step - step 单词
  • \s* - 0+ 个空格
  • (\d+) - 第 1 组:一个或多个数字
  • \s* - 0+ 个空格
  • (.*?) - 第 2 组:任何 0+ 个字符,尽可能少
  • \s* - 0+ 个空格
  • (?=^step\s*\d|\Z) - 紧靠右边,必须有
    • ^step\s*\d - 行首,step,0+ 个空格和一个数字
    • | - 或
    • \Z - 整个字符串的结尾。

快速 Python demo :

import re
text = "This is the first line of the file.\nhere we tell you to make a tea.\n\nstep 1\n\nPour more than enough water for a cup of tea into a regular pot, and bring it to a boil.\n\nstep \n2\n\nThis will prevent the steeping water from dropping in temperature as soon as it is poured in.\n\nstep 3 \n\n\nWhen using tea bags, the measuring has already been done for you - generally it's one tea bag per cup."
results = re.findall(r'^step\s*(\d+)\s*(.*?)\s*(?=^step\s*\d|\Z)', text, re.DOTALL | re.MULTILINE)
print(dict([("step{}".format(x),y) for x,y in results]))

输出:

{'step2': 'This will prevent the steeping water from dropping in temperature as soon as it is poured in.', 'step1': 'Pour more than enough water for a cup of tea into a regular pot, and bring it to a boil.', 'step3': "When using tea bags, the measuring has already been done for you - generally it's one tea bag per cup."}

关于python - 如何从文本文件中提取特定部分?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54752247/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com