gpt4 book ai didi

php - 用于匹配字符串中可疑单词的正则表达式

转载 作者:搜寻专家 更新时间:2023-10-31 20:34:58 24 4
gpt4 key购买 nike

我正在用 PHP 开发一个“单词过滤器”类,除其他外,它需要捕获故意拼写错误的单词。这些词作为句子由用户输入。让我展示一个用户输入的句子的简单示例:

我想要可乐、性、毒品和摇滚乐

上面的例子是正确写出的常用词组。我的类(class)会找到可疑词 sexdrugs,一切都会好起来的。

但我想用户会试图阻止单词的检测并写出一些不同的东西。事实上,他有许多不同的方式来写同一个词,以便某些类型的人仍然可以阅读。例如,单词 sex 可以写成 s3x5ex53xs e x or s 3 x or s33x or 5533xxx of ss 33 xxx 等等

我了解正则表达式的基础知识并尝试了以下模式:

/(\b[\w][\w .'-]+[\w]\b)/g

因为

  • \b字边界
  • [\w] 单词可以以一个字母或一个数字开头...
  • [\w .'-] ...后跟任何字母、数字、空格、点、引号或破折号...
  • + ... 一次或多次...
  • [\w] ...以一个字母或一位数字结尾。
  • \b字边界

部分有效。

如果示例短语写成 I want a coke, 5 3 x, druuu95 and r0ck'n'r011 我得到 3 个匹配项:

  • 我要一杯可乐
  • 5 3 x
  • druuu95 和 r0ck'n'r011

我需要的是8场比赛

  • 想要
  • 一个
  • 可乐
  • 5 3 x
  • druuu95
  • r0ck'n'r011

为了缩短,我需要一个正则表达式来给出句子的每个单词,即使单词以数字开头,包含可变数量的数字、空格、点、破折号和引号,并以字母或结尾数字。

我们将不胜感激。

最佳答案

描述

通常好词的长度为 2 个或更多字母(Ia 除外)并且不包含数字。这种表达方式并非完美无缺,但确实有助于说明为什么进行这种类型的语言匹配异常困难,因为这是一场试图在不被发现的情况下表达自己的创意人员与试图捕捉缺陷的开发团队之间的军备竞赛。

(?:\s+|\A)[#'"[({]?(?!(?:[a-z]{2}\s+){3})(?:[a-zA -Z'-]{2,}|[ia]|i[nst]|o[fnr])[?!.,;:'")}\]]?(?=(?:\s|\Z ))|((?:[a-z]{2}\s+){3}|.*?\b)

Regular expression visualization

** 要更好地查看图像,只需右键单击图像并选择在新窗口中查看

此正则表达式将执行以下操作:

  • 找到所有可接受的词
  • 找到所有其余的并将它们存储在 Capture Group 1 中

例子

现场演示

https://regex101.com/r/cL2bN1/1

说明

NODE                     EXPLANATION
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\A the beginning of the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[#'"[({]? any character of: '#', ''', '"', '[', '(',
'{' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-zA-Z'-]{2,} any character of: 'a' to 'z', 'A' to
'Z', ''', '-' (at least 2 times
(matching the most amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[ia] any character of: 'i', 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
[nst] any character of: 'n', 's', 't'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
o 'o'
----------------------------------------------------------------------
[fnr] any character of: 'f', 'n', 'r'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[?!.,;:'")}\]]? any character of: '?', '!', '.', ',', ';',
':', ''', '"', ')', '}', '\]' (optional
(matching the most amount possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\Z before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------

关于php - 用于匹配字符串中可疑单词的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38235660/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com