gpt4 book ai didi

python - psycopg2 相当于 R dbWriteTable 命令并从 python 代码获得更多性能

转载 作者:太空宇宙 更新时间:2023-11-03 11:06:46 26 4
gpt4 key购买 nike

我刚刚做了一个时间测试,从 csv 加载一个数据数组,在 Postgres 上创建一个数据库,然后用 python 和 R 将表写入它。

我很惊讶,时间非常相似:

首先是 python 代码:(例如)

import timeit
tic = timeit.default_timer()
tic4 = timeit.default_timer()
import xlrd as xl
import psycopg2 as pq
import os
import pandas as pd
import numpy as np
import csv
from pprint import pprint as pp

perf_dir = '/myhomedir'
toc4=timeit.default_timer()

# Create the database
tic1= timeit.default_timer()
os.system('dropdb ptest')
os.system('createdb ptest')
# connect to the database
cn = pq.connect('dbname=ptest user=me')
cr = cn.cursor()
toc1=timeit.default_timer()

# Create the tables:
### load csvs
tic2=timeit.default_timer()
id_files = ('di1','di2','di.c')
id_files = [i+r'.csv' for i in id_files]
id1 = csv.reader(open(os.path.join(perf_dir,id_files[0])),delimiter='\t')
id1 = [i for i in id1]

id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
id2 = [i for i in id2]

id3 = csv.reader(open(os.path.join(perf_dir,id_files[2])),delimiter='\t')
id3 = [i for i in id3]
id3 = [i[1:3] for i in id3]
toc2=timeit.default_timer()

# create tables and fill
### id1 fund classifications
tic3=timeit.default_timer()
cr.execute('CREATE TABLE id1 (%s varchar, %s int PRIMARY KEY, %s int, %s int, %s varchar)' % tuple(id1[0]))
FLDS = 'INSERT INTO id1 (%s,%s,%s,%s,%s) VALUES' % tuple(id1[0])
SQL = FLDS + ' (%s,%s,%s,%s,%s)'
for i in range(1,len(id1)):
data = tuple(id1[i])
cr.execute(SQL,data)

### id2 portfolio group classifications - reference only
cr.execute('CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)' % tuple(id2[0]))
SQL = 'INSERT INTO id2 (%s,%s,%s) VALUES' % tuple(id2[0]) + ' (%s,%s,%s)'
for i in range(1,len(id2)):
data = tuple(id2[i])
cr.execute(SQL,data)

### id3 value variable classifications
cr.execute('CREATE TABLE id3 (%s varchar,%s varchar)' % tuple(id3[0]))
SQL = 'INSERT INTO id3 VALUES(%s,%s)'
for i in range(1,len(id3)):
data = tuple(id3[i])
cr.execute(SQL,data)

cn.commit()

# Timing block - will be commented out in final code
toc3=timeit.default_timer()
toc = timeit.default_timer()
time = (toc - tic)
time1 = toc1 - tic1
time2 = toc2 - tic2
time3 = toc3 - tic3
time4 = toc4 - tic4
print('Overall time: %s' % time)
print('dB create & connect time: %s' % time1)
print('Load id csvs time: %s' % time2)
print('Create tables and write to db time: %s' % time3)
print('Time to import libraries: %s' % time4)

和 R 代码(更清晰)

tic = proc.time()
library(RPostgreSQL)

tic1 = proc.time()
system('dropdb ptest1')
system('createdb ptest1')
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, dbname='ptest1')
toc1 = proc.time()
time1 = toc1 - tic1

tic2 = proc.time()
id.1 = read.csv('/myhomedir/di1.csv',stringsAsFactors=F,sep='\t')
id.2 = read.csv('/myhomedir/di2.csv',stringsAsFactors=F)
id.3 = read.csv('/myhomedir/di.c.csv',stringsAsFactors=F,sep='\t')
id.3 = id.3[,-1]
toc2 = proc.time()
time2 = toc2 - tic2

tic3 = proc.time()
dbWriteTable(con,'id1',id.1)
dbWriteTable(con,'id2',id.2)
dbWriteTable(con,'id3',id.3)
toc3 = proc.time()
time3 = toc3 - tic3

toc = proc.time()
time = toc - tic
tyme = rbind(time1,time2,time3,time)
tyme = data.frame(Function=c('Create & Connect to DB',"Load CSV's for save","Write Table to DB",'Overall Time'),tyme)

我很惊讶两人的时间如此接近。 (我读过很多关于 R 很慢而 Python 非常快的文章)

对于 python

>>> Overall time: 2.48381304741
dB create & connect time: 1.96832108498
Load id csvs time: 0.000378847122192
Create tables and write to db time: 0.35303401947
Time to import libraries: 0.162075042725

对于 R

                    Function user.self sys.self elapsed user.child sys.child
time1 Create & Connect to DB 0.112 0.016 1.943 0.06 0.004
time2 Load CSV's for save 0.008 0.000 0.006 0.00 0.000
time3 Write Table to DB 0.096 0.004 0.349 0.00 0.000
time Overall Time 0.376 0.028 2.463 0.06 0.004

我想知道这是否与我一次INSERT将一行插入到表的 python 版本中这一事实有关。

因此出现了主要问题 - R 代码中的 dbWriteTable block 在 python 中是否有等效项,它会加快处理速度吗?

第二个附属问题是代码是否有任何明显的错误可能会减慢速度。

如果有帮助,很乐意提供样本 csv。

不想在 R 与 Python 之间展开激烈的 war ,只是想知道如何让我的代码更快。

谢谢

最佳答案

试试这个:

    id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
h = tuple(next(id2))

create = '''CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)''' % h
insert = '''INSERT INTO id2 (%s, %s, %s) VALUES (%%s, %%s, %%s)''' % h

...

cr.executemany(insert, id2)

对所有行只传递一次 sql 命令。除了发出额外的 sql 命令的性能开销之外,请注意 sql 字符串本身的大小(以字节为单位)可能使被测真实数据的大小相形见绌。

在没有数组的情况下使用 id2 作为生成器也应该避免将整个数据集加载到内存中。

关于python - psycopg2 相当于 R dbWriteTable 命令并从 python 代码获得更多性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17159841/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com