python - 如何在 Python 中解析这个自定义日志文件

转载作者：行者123 更新时间：2023-11-28 19:43:25

我正在使用 Python 日志记录在处理时生成日志文件，我正在尝试将这些日志文件读取到列表/字典中，然后将其转换为 JSON 并加载到 nosql 数据库中进行处理。

生成的文件格式如下。

2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
  File "<ipython-input-16-132cda1c011d>", line 10, in <module>
    if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files

注意:在您看到的每个新日期之前实际上都有\n 中断，但似乎无法在此处表示。

基本上我正在尝试读取这个文本文件并生成一个如下所示的 json 对象:

{
    'Date': '2015-05-22 16:46:46,985',
    'Type': 'INFO',
    'Message':'Starting to Wait for Files'
}
...

{
    'Date': '2015-05-22 16:48:48,180',
    'Type': 'ERROR',
    'Message':'Failed: Waiting for files the Files from Cloud Storage:  gs://folder/anotherfolder/ Traceback (most recent call last):
               File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}

我遇到的问题:

我可以将每一行添加到列表或字典等中，但错误消息有时会跨越多行，所以我最终会错误地将其拆分。

尝试过:

我曾尝试使用如下代码仅在有效日期拆分行，但我似乎无法获得跨越多行的错误消息。我也尝试了正则表达式并认为这是一个可能的解决方案，但似乎无法找到正确的正则表达式来使用...不知道它是如何工作的所以尝试了一堆复制粘贴但没有任何成功。

with open(filename,'r') as f:
    for key,group in it.groupby(f,lambda line: line.startswith('2015')):
        if key:
            for line in group:
                listNew.append(line)

尝试了一些疯狂的正则表达式，但也没有成功:

logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)

非常感谢任何帮助...谢谢

编辑:

在下面为遇到同样问题的其他人发布了解决方案。

最佳答案

使用@Joran Beasley 的回答，我想出了以下解决方案，它似乎有效:

要点:

我的日志文件总是遵循相同的结构:{Date} - {Type} -{Message} 所以我使用字符串切片和拆分来分解项目需要他们。例如 {Date} 总是 23 个字符，我只想要前 19 个字符。
使用 line.startswith("2015") 是疯狂的，因为日期最终会改变，因此创建了一个新函数，它使用一些正则表达式来匹配我期望的日期格式。再一次，我的日志日期遵循特定的模式，所以我可以得到具体的信息。
文件被读入第一个函数“generateDicts()”，然后调用“matchDate()”函数来查看正在处理的行是否与我正在寻找的{Date}格式匹配。
每次找到有效的 {Date} 格式时都会创建一个新字典，并处理所有内容，直到遇到下一个有效的 {Date}。

分割日志文件的函数。

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
        else:
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))

查看正在处理的行是否以与我要查找的格式匹配的 {Date} 开头的函数

    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
        else:
            matchThis = "NONE"
        return matchThis

关于python - 如何在 Python 中解析这个自定义日志文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30627810/