gpt4 book ai didi

python - 如何获得具有多个值的相同名称在Python中获得唯一结果

转载 作者:行者123 更新时间:2023-12-04 10:43:41 25 4
gpt4 key购买 nike

我有一个很大的 csv 文件,它比较了我的 txt 文件的 URL

python - 如何获得具有多个值的相同名称在Python中获得唯一结果,有没有办法更好地比较两个文件的速度?因为它有一个 1 GB 的最小大 csv 文件

文件1.csv

[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"

文件2.txt
1 www.amazon.com shop
1 wakers.com shop

脚本:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)

我的输出:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')

如何获取url是否相同,获取具有唯一值的总值

如何得到这样的结果?
amazon.com    102.12.14.22 
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95

最佳答案

简短的回答:你不能直接这样做。嗯,你可以,但性能低。

CSV 是一种很好的存储格式,但如果您想做类似的事情,您可能希望将所有内容存储在另一个自定义数据文件中。您可以首先将文件解析为只有唯一 ID 而不是长字符串(如 amazon = 0、wakers = 1 等),以提高性能并降低比较成本。

问题是,这些东西对于变量 csv、内存映射或从 csv 构建数据库可能也很不错(并且对数据库进行更改,仅在需要时转储 csv)

看:How do quickly search through a .csv file in Python以获得更完整的答案。

关于python - 如何获得具有多个值的相同名称在Python中获得唯一结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59819281/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com