gpt4 book ai didi

Python regex - 删除标点符号但保留 原样

转载 作者:行者123 更新时间:2023-12-04 07:46:57 28 4
gpt4 key购买 nike


请建议一种去除标点符号的方法,但不是<unk> 中的那些。或 <UNK> .
例如,来自:the asbestos fiber <unk> <unk| is < unusually <unk once it enters the <<unk>$% with 产生:the asbestos fiber <unk> unk is unusually unk once it enters the unk with 在下面尝试过,但没有达到预期。

text = "the asbestos fiber <unk> <unk| is < unusually <unk once it enters the <<unk>$% with "

replacement = " "
pattern: str = '(?<!<unk)[%s%s]+(?!unk>)' % (re.escape(string.punctuation), r"\s")

re.sub(pattern=pattern, repl=replacement, string=text, flags=re.IGNORECASE).lower().strip()
结果: the asbestos fiber <unk> unk| is unusually unk once it enters the <unk> with

最佳答案

您可以使用以下正则表达式搜索匹配项并将其替换为空格:

(?:(?!<unk>)[\W_](?<!<unk>))+
regex demo .
细节:
  • (?: - 非捕获组的开始:
  • (?!<unk>) - 下一个字符不应该是 <unk> 的起始字符字符序列
  • [\W_] - 任何非字母数字字符
  • (?<!<unk>) - 先前匹配的字符(带有 [\W_] )不能是 <unk> 的起始字符字符序列

  • )+ - 一次或多次。

  • Python demo :
    import re
    text = "the asbestos fiber <unk> <unk| is < unusually <unk once it enters the <<unk>$% with "
    replacement = " "
    pattern: str = r'(?:(?!<unk>)[\W_](?<!<unk>))+'
    print( re.sub(pattern, replacement, text, flags=re.I) )
    # => the asbestos fiber <unk> unk is unusually unk once it enters the <unk> with

    关于Python regex - 删除标点符号但保留 <uk> 原样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67161529/

    28 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com