gpt4 book ai didi

python - 将空格对齐的文本文件转换为 Pandas DataFrame

转载 作者:太空宇宙 更新时间:2023-11-03 16:32:53 25 4
gpt4 key购买 nike

我对 Pandas 还很陌生。我有一个日志文本文件。我试图从文件中获取一些数据点。下面的代码可以为我提供所需的数据,但不是所需的格式。我想要带有两列的 Pandas 数据框。

import os
from collections import Counter
import pandas as pd
#print(os.getcwd())
infile = "myfile.txt"

important = []
keep_phrases = ["Host",
"User-Agent"
]

with open(infile) as f:
f = f.readlines()

for line in f:
for phrase in keep_phrases:
if phrase in line:
important.append(line)

break
#print(type(important))
print(important)
#Counter(important)
pd.DataFrame(important)

这不会给我两列输出。我正在寻找主机和用户代理作为一排。

文本文件示例如下

   15 SessionOpen  c aa.bb.cc.ddd 62667 :8080
15 SessionClose c pipe
15 ReqStart c aa.bb.cc.ddd 62667 442374415
15 RxURL c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag666
15 RxHeader c Host: ll.abrstream.channel4.com
15 RxHeader c Connection: keep-alive
15 RxHeader c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
15 RxHeader c X-Requested-With: ShockwaveFlash/21.0.0.216
15 RxHeader c Accept: */*
15 RxHeader c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
15 RxHeader c Accept-Encoding: gzip, deflate, sdch
15 RxHeader c Accept-Language: en-US,en;q=0.8
15 ReqEnd c 442374415 1461870946.496117592 1461870947.112555504 0.000315428 0.001363039 0.615074873
15 SessionOpen c aa1.bb1.cc1.ddd1 59409 :8080
15 SessionClose c pipe
15 ReqStart c aa1.bb1.cc1.ddd1 59409 442374416
15 RxURL c /gpsApi.php
15 RxHeader c Content-Length: 0
15 RxHeader c Host: map.yanue.net
15 RxHeader c Connection: Keep-Alive
15 RxHeader c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
15 ReqEnd c 442374416 1461870950.580444574 1461870951.139206648 0.000064135 0.001196861 0.557565212
15 SessionOpen c aa1.bb1.cc1.ddd1 52179 :8080
15 SessionClose c pipe
15 ReqStart c aa1.bb1.cc1.ddd1 52179 442374417
15 RxURL c /gpsApi.php
15 RxHeader c Content-Length: 0
15 RxHeader c Host: map.yanue.net
15 RxHeader c Connection: Keep-Alive
15 RxHeader c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
15 ReqEnd c 442374417 1461870951.776547432 1461870952.448071241 0.000062943 0.001109123 0.670414686
18 SessionOpen c aa.bb.cc.ddd 62670 :8080
18 SessionClose c pipe
18 ReqStart c aa.bb.cc.ddd 62670 442374418
18 RxURL c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag667
18 RxHeader c Host: ll.abrstream.channel4.com
18 RxHeader c Connection: keep-alive
18 RxHeader c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
18 RxHeader c X-Requested-With: ShockwaveFlash/21.0.0.216
18 RxHeader c Accept: */*
18 RxHeader c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
18 RxHeader c Accept-Encoding: gzip, deflate, sdch
18 RxHeader c Accept-Language: en-US,en;q=0.8
18 ReqEnd c 442374418 1461870951.920178175 1461870952.507097483 0.001731873 0.001337051 0.585582256
15 SessionOpen c aa1.bb1.cc1.ddd1 48034 :8080
15 SessionClose c pipe

最佳答案

您可以通过创建列表列表来创建数据框,然后使用数据框构造函数。

循环遍历文件的每一行,就像您开始做的那样,然后将每一行分成不同的列。您可以使用re.split创建列列表,限制最大拆分数以将最后一列视为一个元素。或者,如果您知道每个元素始终以相同的方式对齐,则可以使用切片来创建该列表。

import re

df_list = []
with open(infile) as f:
for line in f:
# remove whitespace at the start and the newline at the end
line = line.strip()
# split each column on whitespace
columns = re.split('\s+', line, maxsplit=4)
df_list.append(columns)

然后您可以使用 this answer 中的方法创建数据框。

df = pd.DataFrame(df_list)

关于python - 将空格对齐的文本文件转换为 Pandas DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37448773/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com