gpt4 book ai didi

regex - 用于将 PCRE 正则表达式转换为 emacs 正则表达式的 Elisp 机制

转载 作者:行者123 更新时间:2023-12-03 10:40:46 26 4
gpt4 key购买 nike

我承认对喜欢有明显的偏见PCRE regexps 比 emacs 好得多,如果没有其他原因,当我输入 '(' 我几乎总是想要一个分组运算符。当然,\w 和类似的比其他等价物方便得多。

但是,当然,期望改变 emacs 的内部结构是很疯狂的。但是我认为应该可以从 PCRE experssion 转换为 emacs 表达式,并进行所有需要的转换,以便我可以写:

(defun my-super-regexp-function ...
(search-forward (pcre-convert "__\\w: \d+")))

(或类似)。

有人知道可以做到这一点的elisp库吗?

编辑:从下面的答案中选择一个回复...

哇,我喜欢从 4 天的假期回来寻找大量有趣的答案来整理!我喜欢这两种类型的解决方案的工作。

最后,看起来解决方案的 exec-a-script 和直接 elisp 版本都可以工作,但是从纯粹的速度和“正确性”方法来看,elisp 版本肯定是人们更喜欢的版本(包括我自己) .

最佳答案

https://github.com/joddie/pcre2el是这个答案的最新版本。

pcre2el or rxt (RegeXp Translator or RegeXp Tools) is a utility for working with regular expressions in Emacs, based on a recursive-descent parser for regexp syntax. In addition to converting (a subset of) PCRE syntax into its Emacs equivalent, it can do the following:

  • convert Emacs syntax to PCRE
  • convert either syntax to rx, an S-expression based regexp syntax
  • untangle complex regexps by showing the parse tree in rx form and highlighting the corresponding chunks of code
  • show the complete list of strings (productions) matching a regexp, provided the list is finite
  • provide live font-locking of regexp syntax (so far only for Elisp buffers – other modes on the TODO list)


原始答案的文本如下......

这是一个 quick and ugly Emacs lisp solution (编辑:现在更永久地位于 here )。它主要基于 pcrepattern 中的描述手册页,并逐个标记地工作,仅转换以下结构:
  • 括号分组( .. )
  • 交替|
  • 数字重复 {M,N}
  • 字符串引用 \Q .. \E
  • 简单的字符转义:\a , \c , \e , \f , \n , \r , \t , \x , 和 \ + 八进制数字
  • 字符类:\d , \D , \h , \H , \s , \S , \v , \V
  • \w\W保持原样(使用 Emacs 自己的单词和非单词字符的想法)

  • 它不会对更复杂的 PCRE 断言做任何事情,但它会尝试在字符类中转换转义符。在字符类包括类似 \D 的情况下,这是通过转换为具有交替的非捕获组来完成的。

    它通过了我为它编写的测试,但肯定存在错误,并且逐个 token 扫描的方法可能很慢。换句话说,没有保修。但也许出于某些目的,它可以完成工作中更简单的部分。欢迎有兴趣的人士改进它;-)
    (eval-when-compile (require 'cl))

    (defvar pcre-horizontal-whitespace-chars
    (mapconcat 'char-to-string
    '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
    #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
    #x205F #x3000)
    ""))

    (defvar pcre-vertical-whitespace-chars
    (mapconcat 'char-to-string
    '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))

    (defvar pcre-whitespace-chars
    (mapconcat 'char-to-string '(9 10 12 13 32) ""))

    (defvar pcre-horizontal-whitespace
    (concat "[" pcre-horizontal-whitespace-chars "]"))

    (defvar pcre-non-horizontal-whitespace
    (concat "[^" pcre-horizontal-whitespace-chars "]"))

    (defvar pcre-vertical-whitespace
    (concat "[" pcre-vertical-whitespace-chars "]"))

    (defvar pcre-non-vertical-whitespace
    (concat "[^" pcre-vertical-whitespace-chars "]"))

    (defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))

    (defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))

    (eval-when-compile
    (defmacro pcre-token-case (&rest cases)
    "Consume a token at point and evaluate corresponding forms.

    CASES is a list of `cond'-like clauses, (REGEXP FORMS
    ...). Considering CASES in order, if the text at point matches
    REGEXP then moves point over the matched string and returns the
    value of FORMS. Returns `nil' if none of the CASES matches."
    (declare (debug (&rest (sexp &rest form))))
    `(cond
    ,@(mapcar
    (lambda (case)
    (let ((token (car case))
    (action (cdr case)))
    `((looking-at ,token)
    (goto-char (match-end 0))
    ,@action)))
    cases)
    (t nil))))

    (defun pcre-to-elisp (pcre)
    "Convert PCRE, a regexp in PCRE notation, into Elisp string form."
    (with-temp-buffer
    (insert pcre)
    (goto-char (point-min))
    (let ((capture-count 0) (accum '())
    (case-fold-search nil))
    (while (not (eobp))
    (let ((translated
    (or
    ;; Handle tokens that are treated the same in
    ;; character classes
    (pcre-re-or-class-token-to-elisp)

    ;; Other tokens
    (pcre-token-case
    ("|" "\\|")
    ("(" (incf capture-count) "\\(")
    (")" "\\)")
    ("{" "\\{")
    ("}" "\\}")

    ;; Character class
    ("\\[" (pcre-char-class-to-elisp))

    ;; Backslash + digits => backreference or octal char?
    ("\\\\\\([0-9]+\\)"
    (let* ((digits (match-string 1))
    (dec (string-to-number digits)))
    ;; from "man pcrepattern": If the number is
    ;; less than 10, or if there have been at
    ;; least that many previous capturing left
    ;; parentheses in the expression, the entire
    ;; sequence is taken as a back reference.
    (cond ((< dec 10) (concat "\\" digits))
    ((>= capture-count dec)
    (error "backreference \\%s can't be used in Emacs regexps"
    digits))
    (t
    ;; from "man pcrepattern": if the
    ;; decimal number is greater than 9 and
    ;; there have not been that many
    ;; capturing subpatterns, PCRE re-reads
    ;; up to three octal digits following
    ;; the backslash, and uses them to
    ;; generate a data character. Any
    ;; subsequent digits stand for
    ;; themselves.
    (goto-char (match-beginning 1))
    (re-search-forward "[0-7]\\{0,3\\}")
    (char-to-string (string-to-number (match-string 0) 8))))))

    ;; Regexp quoting.
    ("\\\\Q"
    (let ((beginning (point)))
    (search-forward "\\E")
    (regexp-quote (buffer-substring beginning (match-beginning 0)))))

    ;; Various character classes
    ("\\\\d" "[0-9]")
    ("\\\\D" "[^0-9]")
    ("\\\\h" pcre-horizontal-whitespace)
    ("\\\\H" pcre-non-horizontal-whitespace)
    ("\\\\s" pcre-whitespace)
    ("\\\\S" pcre-non-whitespace)
    ("\\\\v" pcre-vertical-whitespace)
    ("\\\\V" pcre-non-vertical-whitespace)

    ;; Use Emacs' native notion of word characters
    ("\\\\[Ww]" (match-string 0))

    ;; Any other escaped character
    ("\\\\\\(.\\)" (regexp-quote (match-string 1)))

    ;; Any normal character
    ("." (match-string 0))))))
    (push translated accum)))
    (apply 'concat (reverse accum)))))

    (defun pcre-re-or-class-token-to-elisp ()
    "Consume the PCRE token at point and return its Elisp equivalent.

    Handles only tokens which have the same meaning in character
    classes as outside them."
    (pcre-token-case
    ("\\\\a" (char-to-string #x07)) ; bell
    ("\\\\c\\(.\\)" ; control character
    (char-to-string
    (- (string-to-char (upcase (match-string 1))) 64)))
    ("\\\\e" (char-to-string #x1b)) ; escape
    ("\\\\f" (char-to-string #x0c)) ; formfeed
    ("\\\\n" (char-to-string #x0a)) ; linefeed
    ("\\\\r" (char-to-string #x0d)) ; carriage return
    ("\\\\t" (char-to-string #x09)) ; tab
    ("\\\\x\\([A-Za-z0-9]\\{2\\}\\)"
    (char-to-string (string-to-number (match-string 1) 16)))
    ("\\\\x{\\([A-Za-z0-9]*\\)}"
    (char-to-string (string-to-number (match-string 1) 16)))))

    (defun pcre-char-class-to-elisp ()
    "Consume the remaining PCRE character class at point and return its Elisp equivalent.

    Point should be after the opening \"[\" when this is called, and
    will be just after the closing \"]\" when it returns."
    (let ((accum '("["))
    (pcre-char-class-alternatives '())
    (negated nil))
    (when (looking-at "\\^")
    (setq negated t)
    (push "^" accum)
    (forward-char))
    (when (looking-at "\\]") (push "]" accum) (forward-char))

    (while (not (looking-at "\\]"))
    (let ((translated
    (or
    (pcre-re-or-class-token-to-elisp)
    (pcre-token-case
    ;; Backslash + digits => always an octal char
    ("\\\\\\([0-7]\\{1,3\\}\\)"
    (char-to-string (string-to-number (match-string 1) 8)))

    ;; Various character classes. To implement negative char classes,
    ;; we cons them onto the list `pcre-char-class-alternatives' and
    ;; transform the char class into a shy group with alternation
    ("\\\\d" "0-9")
    ("\\\\D" (push (if negated "[0-9]" "[^0-9]")
    pcre-char-class-alternatives) "")
    ("\\\\h" pcre-horizontal-whitespace-chars)
    ("\\\\H" (push (if negated
    pcre-horizontal-whitespace
    pcre-non-horizontal-whitespace)
    pcre-char-class-alternatives) "")
    ("\\\\s" pcre-whitespace-chars)
    ("\\\\S" (push (if negated
    pcre-whitespace
    pcre-non-whitespace)
    pcre-char-class-alternatives) "")
    ("\\\\v" pcre-vertical-whitespace-chars)
    ("\\\\V" (push (if negated
    pcre-vertical-whitespace
    pcre-non-vertical-whitespace)
    pcre-char-class-alternatives) "")
    ("\\\\w" (push (if negated "\\W" "\\w")
    pcre-char-class-alternatives) "")
    ("\\\\W" (push (if negated "\\w" "\\W")
    pcre-char-class-alternatives) "")

    ;; Leave POSIX syntax unchanged
    ("\\[:[a-z]*:\\]" (match-string 0))

    ;; Ignore other escapes
    ("\\\\\\(.\\)" (match-string 0))

    ;; Copy everything else
    ("." (match-string 0))))))
    (push translated accum)))
    (push "]" accum)
    (forward-char)
    (let ((class
    (apply 'concat (reverse accum))))
    (when (or (equal class "[]")
    (equal class "[^]"))
    (setq class ""))
    (if (not pcre-char-class-alternatives)
    class
    (concat "\\(?:"
    class "\\|"
    (mapconcat 'identity
    pcre-char-class-alternatives
    "\\|")
    "\\)")))))

    关于regex - 用于将 PCRE 正则表达式转换为 emacs 正则表达式的 Elisp 机制,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9118183/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com