gpt4 book ai didi

python - 用 @s :eng only on lines starting with *CHI: 标记文本文件中的所有英文单词

转载 作者:太空宇宙 更新时间:2023-11-03 20:02:57 25 4
gpt4 key购买 nike

我正在尝试编写一个 Python 脚本,仅在以 *CHI: 开头的行上标记所有英文单词,并在单词末尾添加“@s:eng”,但代码似乎不起作用。目前,代码如下所示:

import re

with open("transcript 0623.cha", encoding='utf8') as f:

text = f.read()

new_text = re.sub("A-Za-z", "A-Za-z@s:eng", text)
with open("transcript 0623_out.cha", "w", encoding='utf8') as result:
result.write(new_text)

你能建议我如何改进代码吗?

成绩单0623的样本内容如下:

@Begin
@Languages: zho , eng
@Participants: TEA Teacher , CHI Child
@ID: zho,|change_me_later|TEA|||||Teacher|||
@ID: zho,|change_me_later|CHI|||||Child|||
@Transcriber: CKX
@Activities: Storytelling
@Comment: child used the malay word sayang
*TEA: ok , 来 , 开始 .
*CHI: 呃 , the boy@s .
*TEA: 嗯 .
*CHI: have a frog@s .
*TEA: ok .
*TEA: ok do you know what is boy in chinese ?
*TEA: can you help me tell the story in chinese ?
*TEA: ok then do you know what is a frog in chinese ?
*TEA: ok , come .
*TEA: go to the next page .
*CHI: when the boy sleeping , then the frog come out@s .
*TEA: ok .
*TEA: 还有 吗 ?
*CHI: the cat also sleeping@s .
*TEA: ok .
*TEA: do you know what is cat in chinese ?
*TEA: 嗯 , what is it ?
*CHI: 猫 .
*TEA: ok .
*TEA: so can you use your chinese for cat to help me tell the story ?
*TEA: 嗯 ?
*CHI: 猫 睡觉 .
*TEA: 啊 , 很 好 .
*TEA: 还有 吗 ?
*CHI: frog come out@s .
*TEA: ok .
*TEA: 很 好 .
*TEA: 还有 吗 ?
*CHI: next one@s .
*TEA: ok .
*CHI: the boy wake up@s .
*CHI: and , the frog is gone@s .
*TEA: 嗯 .
*CHI: then , maybe , the frog went out the window@s .
*TEA: 嗯 , ok .
*CHI: the boy is looking for the frog@s .
*TEA: 嗯 .
*CHI: the cat is looking for the frog@s .
*TEA: ok what is cat in chinese again ?
*CHI: what@s ?
*TEA: what is cat in chinese again ?
*CHI: 猫 .
*TEA: 嗯 .
*TEA: ok can you use the chinese word for cat to tell me the story again ?
*TEA: 嗯 ?
*CHI: 猫 looking for the@s .
*TEA: 啊 .
*CHI: for the@s .
*TEA: 嗯 .
*CHI: frog@s .
*TEA: ok .
*TEA: very good .
*TEA: anything else ?
*TEA: ok .
*CHI: the@s 猫 go in@s .
*CHI: and put the bottle in here@s .
*TEA: 嗯 .
*CHI: the boy has do this@s .
*TEA: 嗯 .
*CHI: the cat fall down@s .
*TEA: ok what is cat in chinese again ?
*CHI: 猫 fall down@s .
*TEA: 嗯 .
*CHI: and get the bottle@s .
*CHI: get the bottle@s .
*TEA: ok .
*TEA: very good .
*TEA: ok anything else ?
*TEA: anything else ?
*TEA: ok .
*CHI: the boy go and sayang the cat@s .
*TEA: 嗯 .
*TEA: what is cat in chinese ?
*CHI: the , the boy go and sayang the@s 猫 .
*TEA: 啊 , ok .
*TEA: very good .
*CHI: and then the bottle break@s .
*TEA: ok .
*TEA: very good .
*TEA: anything else ?
*TEA: come .
*TEA: ok this whole thing is together .
*CHI: the boy is calling for the frog@s .
*TEA: 嗯 .
*CHI: the cat is looking underneath the table@s .
*TEA: ok what is cat in chinese again ?
*CHI: the@s 猫 looking for the frog underneath@s .
*TEA: 嗯 , ok .
*CHI: they looking inside the hole if the frog is here@s .
*TEA: 嗯 .
*TEA: anything else ?
*CHI: then the boy is here@s .
*TEA: 啊 , ok very good .
*TEA: anything else ?
*CHI: the boy fall down into the water@s .
*CHI: and the cat also@s .
*CHI: and then the log break@s .
*TEA: 嗯 .
*TEA: do you know what is water in chinese ?
*TEA: what is it ?
*CHI: 水 .
*TEA: ok can you tell me the story again with the word , with the , with
the chinese word for water ?
*TEA: 嗯 ?
*CHI: the boy fall down@s .
*CHI: and the@s 猫 too@s .
*CHI: and both of them fall in the@s 水 .
*TEA: ok , very good .
*CHI: and then they all get wet@s .
*TEA: 嗯 .
*TEA: ok .
*CHI: they found some water on the log@s .
*TEA: 嗯 .
*CHI: they found so many frogs@s .
*TEA: 嗯 .
*CHI: and is this the frog that they have@s ?
*TEA: 嗯 .
*TEA: ok .
*CHI: then they say bye bye .
*TEA: 嗯 .
*TEA: you know how to say bye bye in chinese ?
*CHI: 再见 .
*TEA: ok .
*TEA: can you repeat this part again in chinese ?
*CHI: and then the boy and the cat and the frog
say@s 再见 .
*TEA: ok what is cat in chinese again ?
*CHI: 猫 .
*TEA: 啊 .
*TEA: can you repeat the whole thing ?
*CHI: the boy and the@s 猫 and the , and the frog@s .
*TEA: 嗯 .
*CHI: say@s 再见 .
*TEA: ok .
*TEA: very good .
*TEA: thank you for telling me the story ok ?
@End

最佳答案

您的正则表达式不正确:

new_text = re.sub("A-Za-z", "A-Za-z@s:eng", text)

搜索模式正在查找“大写 A、连字符、大写 Z、小写 a、连字符、小写 z”。如果您只想检查“以 *CHI:”开头的行,那么“*CHI:”应该是您搜索模式的一部分。

替换模式仅用“A-Za-z@s:eng ”替换整行。您需要捕获要保留的文本部分,然后重复使用它们,并在单词末尾附加“@s:eng””。

您可以使用以下内容:

import re

i_path = "transcript 0623.cha"
o_path = "transcript 0623_out.cha"
mark_pattern = re.compile("\\*CHI:.*")
word_pattern = re.compile("([A-Za-z]+)")

with open(i_path, encoding='utf8') as i_file, open(o_path, "w", encoding='utf8') as o_file:
for line in i_file:
# Split into possible words
parts = line.split()

if mark_pattern.match(parts[0]) is None:
o_file.write(line)
continue

# Got a CHI line
new_line = line
for word in parts[1:]:
matches = word_pattern.match(word)
if matches:
old = f"\\b{word}\\b"
new = f"{matches.group(1)}@s:eng"
new_line = re.sub(old, new, new_line, count=1)
o_file.write(new_line)

说明:

  • mark_pattern = re.compile("\\*CHI:.*")
    • 匹配以“*CHI: *”开头的行的模式。你需要逃避*一开始是因为 * is a special character
    • re文档说,“当表达式在单个程序中多次使用时,使用 re.compile() 并保存生成的正则表达式对象以供重用会更有效。”
  • word_pattern = re.compile("([A-Za-z]+)")
    • 匹配单词的模式。您需要使用[]指示一组字符,然后 +指示匹配前面模式的 1 次或多次重复。
  • for line in i_file
    • 逐行处理文件会更容易(并且内存效率更高)。您可以轻松调试每行的正则表达式搜索和替换。也许可以一次完成所有这一切read()/readlines()但在这里我更喜欢可读性。
  • parts = line.split()
    • 要查找单词,请将行拆分为可能的单词。
  • .match(..)

首先,我检查第一个单词 ( parts[0] ) 是否是“CHI”模式。如果不是,只需将该行按原样写入输出文件。如果是,则继续按字处理。

对于每个可能的单词,检查它是否与单词模式匹配。如果是,请使用re.sub将行中该单词的旧实例替换为 word@s:eng 。对每个单词重复此匹配然后替换并将替换项累积在 new_line 中。请注意,通过使用 matches.group(1) ,我替换 @s在原来的行中(就像“frog@s”中的那个变成“frog@s:eng”)。

我使用 f 字符串来表示 oldnew 。如果您使用的不是 Python3.6+,则可以使用常规字符串连接/格式化。

结果:

I: *CHI:   have a frog@s .
O: *CHI: have@s:eng a@s:eng frog@s:eng .

I: *CHI: the cat is looking underneath the table@s .
O: *CHI: the@s:eng@s:eng cat@s:eng is@s:eng looking@s:eng underneath@s:eng the table@s:eng .

(ignore punctuations)
I: *CHI: what@s ?
O: *CHI: what@s:eng ?

(ignore non-English words in lines)
I: *CHI: 猫 fall down@s .
O: *CHI: 猫 fall@s:eng down@s:eng .

(unaffected if not starts with CHI)
I: *TEA: ok this whole thing is together .
O: *TEA: ok this whole thing is together .

关于python - 用 @s :eng only on lines starting with *CHI: 标记文本文件中的所有英文单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59123951/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com