gpt4 book ai didi

python - 在python中解析wget日志文件

转载 作者:太空宇宙 更新时间:2023-11-03 18:04:55 24 4
gpt4 key购买 nike

我有一个 wget 日志文件,想要解析该文件,以便我可以提取每个日志条目的相关信息。例如 IP 地址、时间戳、URL 等

下面打印了示例日志文件。每个条目的行数和信息细节并不相同。一致的是每一行的表示法。

我能够提取单独的行,但我想要一个多维数组(或类似的):

import re

f = open('c:/r1/log.txt', 'r').read()


split_log = re.findall('--[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*', f)

print split_log

print len(split_log)

for element in split_log:
print(element)


####### Start log file example

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]

--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'

0K .......... ....... 109K=0.2s

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]

--2014-11-22 10:51:32-- h ttp://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'

0K .......... .......... .. 118K=0.2s

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]

--2014-11-22 10:51:32-- h ttp://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'

0K .......... ....... 111K=0.2s

最佳答案

以下是如何提取所需数据并将其存储在元组列表中的方法。

我在这里使用的正则表达式并不完美,但它们可以很好地处理您的示例数据。我修改了您的原始正则表达式以使用更具可读性的 \d 而不是等效的 [0-9]。我还使用了原始字符串,这通常使正则表达式的使用变得更容易。

我已将您的日志数据作为三引号字符串嵌入到我的代码中,因此我不必担心文件处理。我注意到您的日志文件中的某些 URL 中有空格,例如

h ttp://www.itb.ie/Vacancies/index.html

但我认为这些空格是复制和粘贴的产物,它们实际上并不存在于真实的日志数据中。如果情况并非如此,那么您的程序将需要做额外的工作来处理这些无关的空间。

我还修改了日志数据中的 IP 地址,因此它们并不完全相同,只是为了确保 findall 找到的每个 IP 都与正确的时间戳正确关联网址。

#! /usr/bin/env python

import re

log_lines = '''

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]

--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'

0K .......... ....... 109K=0.2s

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]

--2014-11-22 10:51:32-- http://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.25|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'

0K .......... .......... .. 118K=0.2s

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]

--2014-11-22 10:51:32-- http://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.26|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'

0K .......... ....... 111K=0.2s
'''

time_and_url_pat = re.compile(r'--(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})--\s+(.*)')
ip_pat = re.compile(r'Connecting to.*\|(.*?)\|')

time_and_url_list = time_and_url_pat.findall(log_lines)
print '\ntime and url\n', time_and_url_list

ip_list = ip_pat.findall(log_lines)
print '\nip\n', ip_list

all_data = [(t, u, i) for (t, u), i in zip(time_and_url_list, ip_list)]
print '\nall\n', all_data, '\n'

for t in all_data:
print t

输出

time and url
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html')]

ip
['193.1.36.24', '193.1.36.25', '193.1.36.26']

all
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')]

('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24')
('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25')
('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')

此代码的最后一部分使用列表理解将 time_and_url_list 和 ip_list 中的数据重新组织为单个元组列表,并使用 zip 内置函数处理这两个列表平行线。如果这部分有点难以理解,请告诉我,我会尝试进一步解释。

关于python - 在python中解析wget日志文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27076980/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com