gpt4 book ai didi

python - R 包 teradatasql dbGetQuery 永远返回更大的数据,而 python 工作得非常快

转载 作者:行者123 更新时间:2023-12-05 02:51:53 25 4
gpt4 key购买 nike

我有一个数据库表,我试图从中获取两列的 5+ 百万行。

以下 python 代码可以完美且快速地运行(通过查询检索并写入 CSV 的全部 5 行数据大约需要 3 分钟):

import pandas as pd
import teradatasql

hostname = "myhostname.domain.com"
username = "myusername"
password = "mypassword"

with teradatasql.connect(host=hostname, user=username, password=password, encryptdata=True) as conn:
df = pd.read_sql("SELECT COL1, COL2 FROM MY_TABLE", conn)

df.to_csv(mypath, sep = '\t', index = False)

以下 R 中带有 teradatasql 包的代码适用于要检索的显式提供的行计数的小值。但是,当 n 足够大时(实际上并没有那么大),或者当我要求它检索完整的 5 行以上的数据集时,它会花费大量时间或者几乎永远不会返回。

知道发生了什么吗?

library(teradatasql)

dbconn <- DBI::dbConnect(
teradatasql::TeradataDriver(),
host = 'myhostname.domain.com',
user = 'myusername', password = 'mypassword'
)

dbExecute(dbconn, "SELECT COL1, COL2 FROM MY_TABLE")
[1] 5348946

system.time(dbGetQuery(dbconn, "SELECT COL1, COL2 FROM MY_TABLE", n = 10))
user system elapsed
0.084 0.016 1.496

system.time(dbGetQuery(dbconn, "SELECT COL1, COL2 FROM MY_TABLE", n = 100))
user system elapsed
0.104 0.024 1.548

system.time(dbGetQuery(dbconn, "SELECT COL1, COL2 FROM MY_TABLE", n = 1000))
user system elapsed
0.488 0.036 1.826

system.time(dbGetQuery(dbconn, "SELECT COL1, COL2 FROM MY_TABLE", n = 10000))
user system elapsed
7.484 0.100 9.413

system.time(dbGetQuery(dbconn, "SELECT COL1, COL2 FROM MY_TABLE", n = 100000))
user system elapsed
767.824 4.648 782.518

system.time(dbGetQuery(dbconn, "SELECT COL1, COL2 FROM MY_TABLE", n = 5348946))
< DOES NOT RETURN IN HOURS >

这里有一些版本信息供引用:

> packageVersion('teradatasql')
[1] ‘17.0.0.2’
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 6.1
year 2019
month 07
day 05
svn rev 76782
language R
version.string R version 3.6.1 (2019-07-05)
nickname Action of the Toes

最佳答案

teradatasql 驱动程序从获取的结果集行在内存中构造一个大的data.frame 非常慢。

为了获得良好的提取性能,您希望限制一次从结果集中提取的行数。

res <- DBI::dbSendQuery (con, "select * from mytable")
repeat {
df <- DBI::dbFetch (res, n = 100)
if (nrow (df) == 0) { break }
}

以下是从具有 integer 列和 varchar(100) 列的双列表中获取行的一些非正式性能测试的结果。一次获取 100 行时性能最佳。

Fetched 100000 total rows (10 rows at a time) in 28.6985738277435 seconds, throughput = 3484.49371039225 rows/sec
Fetched 100000 total rows (50 rows at a time) in 23.4930009841919 seconds, throughput = 4256.58689016736 rows/sec
Fetched 100000 total rows (100 rows at a time) in 22.7485280036926 seconds, throughput = 4395.8888233897 rows/sec
Fetched 100000 total rows (500 rows at a time) in 24.1652879714966 seconds, throughput = 4138.16711466265 rows/sec
Fetched 100000 total rows (1000 rows at a time) in 25.222993850708 seconds, throughput = 3964.63641833672 rows/sec
Fetched 100000 total rows (2000 rows at a time) in 27.1710178852081 seconds, throughput = 3680.3921156903 rows/sec
Fetched 100000 total rows (5000 rows at a time) in 34.9067471027374 seconds, throughput = 2864.77567519197 rows/sec
Fetched 100000 total rows (10000 rows at a time) in 45.7679090499878 seconds, throughput = 2184.9370459721 rows/sec

关于python - R 包 teradatasql dbGetQuery 永远返回更大的数据,而 python 工作得非常快,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62924646/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com