python - Python 中的多行匹配-6ren

python - Python 中的多行匹配

转载作者：太空狗更新时间：2023-10-29 21:55:21

26

4

我已经阅读了所有我能找到的文章，甚至理解了其中的一些，但作为一个 Python 新手，我仍然有点迷茫，希望得到帮助:)

我正在编写一个脚本来从特定于应用程序的日志文件中解析感兴趣的项目，每一行都以我可以匹配的时间戳开头，我可以定义两件事来识别我想要捕获的内容，一些是部分的content 和一个字符串，它将终止我想要提取的内容。

我的问题是多行的，在大多数情况下，每个日志行都以换行符终止，但有些条目包含 SQL，其中可能有新行，因此会在日志中创建新行。

所以，在一个简单的情况下，我可能会这样:

[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)

这一切都显示为一行，我可以与之匹配:

re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')

然而，在某些情况下，SQL 中可能会有换行符，因此我仍想捕获它(并可能用空格替换换行符)。我目前正在一次一行地阅读文件，这显然行不通，所以...

我需要一次处理整个文件吗？它们的大小通常为 20mb。如何读取整个文件并遍历它以查找单行或多行 block ？
我该如何编写一个多行正则表达式来匹配一行中的所有内容或分布在多行中的内容？

我的总体目标是对其进行参数化，以便我可以使用它来提取匹配起始字符串(始终是一行的开头)、结束字符串(我想要捕获到的位置)和值的不同模式的日志条目这是它们之间的标识符。

在此先感谢您的帮助!

克里斯。

import sys, getopt, os, re

sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')

lines = []
with open(logFileName, 'r') as f:
    for line in f:
        if lineStartsWith.match(line) and lineContains.match(line):
            if lineEndsWith.match(line) :
                print 'Full Line Found'
                print line
                print "- Record Separator -"
            else:
                print 'Partial Line Found'
                print line
                print "- Record Separator -"

print "--- DONE ----"

下一步，对于我的部分行，我将继续阅读，直到找到 lineEndsWith 并将这些行组合成一个 block 。

我不是专家，所以随时欢迎提出建议!

更新 - 所以我让它工作了，感谢所有帮助指导事情的回应，我意识到它并不漂亮，我需要清理我的 if/elif 困惑并使其更有效率，但它正在工作!感谢所有帮助。

import sys, getopt, os, re

sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"

print "--- START ----"

lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')

lines = []

multiLine = False

with open(logFileName, 'r') as f:
    for line in f:
        if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
            lines.append(line.replace("\n", " "))
        elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
            #Found the start of a multi-line entry
            multiLineString = line
            multiLine = True
        elif multiLine and not lineEndsWith.match(line):
            multiLineString = multiLineString + line
        elif multiLine and lineEndsWith.match(line):
            multiLineString = multiLineString + line
            multiLineString = multiLineString.replace("\n", " ")
            lines.append(multiLineString)
            multiLine = False

for line in lines:
    print line

最佳答案

Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?

这里有两个选项。

您可以逐 block 读取文件，确保将每个 block 末尾的任何“剩余”位附加到下一个 block 的开头，然后搜索每个 block 。当然，您必须通过查看您的数据格式以及您的正则表达式可以匹配的内容来弄清楚什么算作“剩余”，理论上可以将多个 block 都算作剩余……

或者你可以 mmap文件。 mmap 的作用类似于字节(或类似于 Python 2.x 中的 str)，并由操作系统根据需要处理分页 block 的进出。除非您尝试处理绝对巨大的文件(32 位为千兆字节，64 位甚至更多)，否则这是微不足道且高效的:

with open('bigfile', 'rb') as f:
    with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
        for match in compiled_re.finditer(m):
            do_stuff(match)

在旧版本的 Python 中，mmap 不是上下文管理器，因此您需要将其包裹在 contextlib.closing 中(或者只使用显式 close 如果你愿意的话)。

我将如何编写一个多行正则表达式来匹配一行中的整个内容或它分布在多行中？

您可以使用 DOTALL标志，它使 . 匹配换行符。您可以改为使用 MULTILINE标记并放入适当的 $ 和/或 ^ 字符，但这会使简单的情况变得更加困难，而且很少需要这样做。这是一个 DOTALL 示例(使用更简单的正则表达式使其更明显):

>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut     O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item  where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and 
    (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,'  $AAAA  ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc  (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']

如您所见，第二个 .*? 与空格一样容易匹配换行符。

如果您只是想将换行符视为空格，则两者都不需要； '\s' 已经捕获换行符。

例如:

>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']

关于python - Python 中的多行匹配，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18494497/

26

4

0

文章推荐： python - 在 cython 中创建小数组需要大量时间

文章推荐： c# - 创建动画按钮

文章推荐： python - 放入 Python 异常消息的详细信息量的约定？

awk - 如果行与“foo”匹配，线上方与“bar”匹配，线下方与“baz”匹配，则删除行？
使用sed和/或awk，仅在行包含字符串“ foo”并且行之前和之后的行分别包含字符串“ bar”和“ baz”时，我才希望删除行。因此，对于此输入： blah blah foo blah bar
c# - 如何按 X% 匹配 2 个字符串(即 >90% 匹配)
例如: S1: "some filename contains few words.txt" S2:“一些文件名包含几个单词 - draft.txt” S3:“一些文件名包含几个单词 - 另一个 dr
R 合并数据帧，允许不精确的 ID 匹配(例如，附加字符 1234 匹配 ab1234)
我正在尝试处理一些非常困惑的数据。我需要通过样本 ID 合并两个包含不同类型数据的大数据框。问题是一张表的样本 ID 有许多不同的格式，但大多数都包含用于匹配其 ID 中某处所需的 ID 字符串，例如
css - 匹配 col-md 时显示 div，匹配 col-sm 时不显示
我想在匹配特定屏幕尺寸时显示特定图像。在这种情况下，对于 Bootstrap ，我使用 col-xx-## 作为我的选择。但似乎它并没有真正按照我认为应该的方式工作。基本思路，我想显示一种全屏图像，
apache - mod_rewrite 问题 : RewriteCond %{REQUEST_FILENAME} ! -f 匹配，即使 REQUEST_FILENAME 不应(完全)匹配
出于某种原因，这条规则 RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.*
F# 匹配 ->
我想做类似的东西(Nemerle 语法) def something = match(STT) | 1 with st= "Summ" | 2 with st= "AVG" =>
JavaScript 匹配
假设这是我的代码 var str="abc=1234587;abc=19855284;abc=1234587;abc=19855284;abc=1234587;abc=19855284;abc=123
JavaScript 匹配
我怎样才能得到这个字符串的数字:'(31.5393701, -82.46235569999999)' 我已经在尝试了，但这离解决方案还很远:) text.match(/$(\d+),(\d+)$/
JavaScript 匹配
如何去除输出中的逗号 (,)？有没有更好的方法从字符串或句子中搜索 url。 alert(" http://www.cnn.com df".match(/https?:\/\/([-\w\.]+
Python - 匹配
a = ('one', 'two') b = ('ten', 'ten') z = [('four', 'five', 'six'), ('one', 'two', 'twenty')] 我正在尝试
vba - 循环遍历行和列时的索引/匹配
我已经编写了以下代码，我希望用它来查找从第 21 列到另一张表中最后一行的值，并根据这张表中 A 列和另一张表中 B 列中的值将它们返回到这张表床单。当我使用下面的代码时，我得到一个工作表错误。你能
Excel 匹配 IF 语句未正确评估
我在以下结构中有两列 A B 1 49 4922039670 我已经能够评估 =LEN(A1)如2 , =LEFT(B1,2)如49 , 和 =LEFT(B1,LEN(A1)
基于行首的 Vim 匹配
我有一个文件，其中一行可以以 + 开头, -或 * .在其中一些行之间可以有以字母或数字(一般文本)开头的行(也包含这些字符，但不在第 1 列中!)。知道这一点，设置匹配和突出显示机制的最简单方法是
正则表达式:匹配，但如果在评论中则不匹配
我有一个数据字段文件，其中可能包含注释，如下所示: id, data, data, data 101 a, b, c 102 d, e, f 103 g, h, i // has to do with
匹配 url 的正则表达式模式
我有以下模式:/^\/(?P.+)$/匹配:/url . 我的问题是它也匹配 /url/page ，如何忽略/在这个正则表达式中？该模式应该: 模式匹配:/url 模式不匹配:/url/page 提
r - R中多维度的聚类/匹配
我有一个非常庞大且复杂的数据集，其中包含许多对公司的观察。公司的一些观察是多余的，我需要制作一个键来将多余的观察映射到一个单独的观察。然而，判断他们是否真的代表同一家公司的唯一方法是通过各种变量的相似
xpath 匹配 - 查找值不在值集中的标签是否存在
我有以下 XML A B C 我想查找 if not(exists(//Record/subRecord
javascript - 匹配/不匹配的正则表达式上没有出现警报框？
我制作了一个正则表达式来验证潜在的比特币地址，现在当我单击报价按钮时，我希望根据正则表达式检查表单中输入的值，但它不起作用。 https://jsfiddle.net/arkqdc8a/5/ var
sql - 检查支架是否平衡/匹配
我有一些 MS Word 文档，我已将其全部内容转移到 SQL 表中。内容包含多个方括号和大括号，例如 [{a} as at [b],] {c,} {d,} etc 我需要进行检查以确保括号平衡/匹
JavaScript Unicode 匹配
我正在使用 Node.js 从 XML 文件读取数据。但是当我尝试将文件中的数据与文字进行比较时，它不匹配，即使它看起来相同: const parser: xml2js.Parser = new

首页

博学

6Ren·AI

商城

python - Python 中的多行匹配