gpt4 book ai didi

regex - perl 解析格式错误的括号文本

转载 作者:行者123 更新时间:2023-12-04 06:07:25 26 4
gpt4 key购买 nike

我将一串文本分成短语,每个短语用方括号括起来:

[pX textX/labelX] [pY textY/labelY] [pZ textZ/labelZ] [textA/labelA]

有时块不以 p 字符开头(如上面的最后一个)。

我的问题是我需要捕获每个块。这在正常情况下是可以的,但有时此输入格式错误,例如,某些块可能只有一个括号,或者没有。所以它可能看起来像这样:
 [pX textX/labelX] pY textY/labelY] textZ/labelZ

但它应该是这样的:
 [pX textX/labelX] [pY textY/labelY] [textZ/labelZ]

该问题不包括嵌套括号。在以前所未有的方式深入研究了大量不同人的正则表达式解决方案(我是正则表达式的新手),并下载了备忘单并获得了正则表达式工具 (Expresso) 之后,我仍然不知道如何做到这一点。有任何想法吗?也许正则表达式不起作用。但是这个问题是如何解决的呢?我想这不是一个非常独特的问题。

编辑

下面是一个具体的例子:
$data= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";

这是来自@FailedDev 的一个很棒的紧凑型解决方案:
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) { # matched text = $& }

但我认为需要补充两点来强调这个问题:
  • 有些块根本没有括号
  • ,/PUNC w#hm/CC_PRP_MP3] 是需要分离的独立块。

  • 但是,由于这种情况是固定的(即标点符号后跟右侧只有一个方括号的文本/标签模式),我将其硬编码到解决方案中,如下所示:
    my @stuff;
    while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    if($& =~ m/(^[\S]\/PUNC )(.*\])/) # match a "./PUNC" mark followed by a "phrase]"
    {
    @bits = split(/ /,$&); # split by space
    push(@stuff, $bits[0]); # just grab the first chunk before space, a PUNC
    push(@stuff, substr($&, 7)); # after that space is the other chunk
    }
    else { push(@stuff, $&); }
    }
    foreach(@stuff){ print $_; }

    尝试我在编辑中添加的示例,除了一个问题之外,这工作得很好。最后一个 ./PUNC 被遗漏了,所以输出是:
    [VP sysmH/VBD_MS3]
    [PP ll#/IN_DET Axryn/NNS_MP]
    ,/PUNC
    w#hm/CC_PRP_MP3]
    [NP AEDA'/NN]
    ,/PUNC
    [PP b#/IN m/NN_FS]
    [NP >HyAnA/NN]

    我怎样才能保留最后一块?

    最佳答案

    你可以用这个

    /(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*)/

    假设你的字符串是这样的:
    [pX textX/labelX] pY textY/labelY]  pY textY/labelY]  pY textY/labelY]  [pY textY/labelY] [3940-823490-2 [30-94823049 [32904823498]

    它不适用于例如: pY [[[textY/labelY]
    Perl特定解决方案:
    while ($subject =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    # matched text = $&
    }

    更新 :
    /(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s+[^[]+?(?:\s+|$))/

    这适用于您更新的字符串,但如果需要,您应该修剪结果的空格。

    更新:2
    /(\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s*[^[]+?(?:\s+|$))/

    我建议打开一个不同的问题,因为你原来的问题与上一个完全不同。
    "
    ( # Match the regular expression below and capture its match into backreference number 1
    # Match either the regular expression below (attempting the next alternative only if this one fails)
    \[ # Match the character “[” literally
    [^[] # Match any character that is NOT a “[”
    *? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    ] # Match the character “]” literally
    | # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
    [^[ ] # Match a single character NOT present in the list “[ ”
    . # Match any single character that is not a line break character
    *? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    ] # Match the character “]” literally
    | # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
    \[ # Match the character “[” literally
    [^[ ] # Match a single character NOT present in the list “[ ”
    * # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    | # Or match regular expression number 4 below (the entire group fails if this one fails to match)
    \s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
    * # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    [^[] # Match any character that is NOT a “[”
    +? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
    (?: # Match the regular expression below
    # Match either the regular expression below (attempting the next alternative only if this one fails)
    \s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
    + # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    | # Or match regular expression number 2 below (the entire group fails if this one fails to match)
    $ # Assert position at the end of the string (or before the line break at the end of the string, if any)
    )
    )
    "

    关于regex - perl 解析格式错误的括号文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8188075/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com