gpt4 book ai didi

Python 大型多列表高效查询

转载 作者:太空狗 更新时间:2023-10-30 02:19:28 25 4
gpt4 key购买 nike

我正在尝试创建有关如何仅使用 Python 操作由 CSV 表组成的海量数据库的示例。

我想找到一种方法来模拟通过一些 list()

传播的表中的高效索引查询

下面的示例在 3.2Ghz Core i5 中需要 24 秒

#!/usr/bin/env python
import csv
MAINDIR = "../"
pf = open (MAINDIR+"atp_players.csv")
players = [p for p in csv.reader(pf)]
rf = open (MAINDIR+"atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
player = filter(lambda x: x[0]==i[2],players)[0]
print "%s(%s),(%s) Points: %s"%(player[2],player[5],player[3],i[3])

对于 this dataset .

我们将不胜感激更高效或更 pythonic 的方式。

最佳答案

您可以itertools.islice 而不是读取所有行并使用itertools.ifilter:

import csv
from itertools import islice,ifilter

MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
# only get first ten rows using islice
for i in islice(rankings, None, 10):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")

不太确定 filter(lambda x: x[0]==i[2],players)[0] 在做什么,您似乎每次都在搜索整个玩家列表并且只保留第一个元素。以第一个元素为键对列表进行一次排序并使用二分法搜索或构建一个以第一个元素为键和行为值的字典然后简单地进行查找可能是值得的。

import csv
from itertools import islice,ifilter
from collections import OrderedDict

MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = OrderedDict((row[0],row) for row in csv.reader(pf))
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
# now constant work getting row as opposed to 0(n)
player = players.get(i[2])

您必须使用什么默认值,或者是否需要任何默认值。

如果您在每行的开头有重复元素,但只想返回第一次出现的元素:

with  open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[key] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])

输出:

Djokovic(SRB),(R) Points: 11360
Federer(SUI),(R) Points: 9625
Nadal(ESP),(L) Points: 6585
Wawrinka(SUI),(R) Points: 5120
Nishikori(JPN),(R) Points: 5025
Murray(GBR),(R) Points: 4675
Berdych(CZE),(R) Points: 4600
Raonic(CAN),(R) Points: 4440
Cilic(CRO),(R) Points: 4150
Ferrer(ESP),(R) Points: 4045

十个玩家的代码时间显示 ifilter 是最快的,但是当我们提高排名时我们会看到 dict 获胜,以及您的代码扩展有多糟糕:

In [33]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv") players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
10 loops, best of 3: 123 ms per loop

In [34]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf: players = list(csv.reader(pf))
rankings = csv.reader(rf) # only get first ten rows using islice
for i in islice(rankings, None, 10):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 43.6 ms per loop

In [35]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop

现在有 100 个玩家时,您会发现 dict 和 10 个玩家一样快。构建 dict 的成本已被恒定时间查找所抵消:

In [38]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open("/tennis_atp-master/atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
# only get first ten rows using islice
for i in islice(rankings, None, 100):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 120 ms per loop

In [39]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 100):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop

In [40]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv")
players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:100]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
1 loops, best of 3: 806 ms per loop

对于 250 名玩家:

# your code
1 loops, best of 3: 1.86 s per loop

# dict
10 loops, best of 3: 50.7 ms per loop

# ifilter
10 loops, best of 3: 483 ms per loop

循环整个排名的最终测试:

# your code

1 loops, best of 3: 2min 40s per loop

# dict
10 loops, best of 3: 67 ms per loop

# ifilter
1 loops, best of 3: 1min 3s per loop

所以您可以看到,当我们循环遍历更多排名时,dict 选项是迄今为止最有效的运行时间,并且扩展性非常好。

关于Python 大型多列表高效查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29711646/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com