pycharm - 使用 PyCharm 通过 JDBC 连接到 AWS Athena

pycharm - 使用 PyCharm 通过 JDBC 连接到 AWS Athena - fetchSize 问题

转载作者：行者123 更新时间：2023-12-02 18:44:15

我已使用 PyCharm Pro 版本连接到 AWS Athena。它连接成功，但每当我运行查询时，我都会得到:

The requested fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try again. Refer to the Athena documentation for valid fetchSize values.

我已从 AWS Athena JDBC documentation 下载了 Athena JDBC 驱动程序

可能是什么问题？

最佳答案

关于获取大小、JDBC 和 AWS athena，需要考虑一件事。似乎有一个 semi-documented but well known limit of 1000 rows per fetch 。我知道热门PyAthenaJDBC library将其设置为default fetch size 。所以，这可能是你问题的一部分。

当我尝试一次获取超过 1000 行时，可能会产生获取大小错误。

from pyathenajdbc import connect 
conn = connect(s3_staging_dir='s3://SOMEBUCKET/', 
region_name='us-east-1')
cur = conn.cursor()
cur.execute('SELECT * FROM SOMEDATABASE.big_table LIMIT 5000')
results = cur.fetchall()
print len(results)
# Note: The cursor class actually has a setter method to 
#       keep users from setting illegal fetch sizes   
cur._arraysize = 1001 # Set array size one greater than the default
cur.execute('SELECT * FROM athena_test.big_table LIMIT 5000')
results = cur.fetchall() # Generate an error

java.sql.SQLExceptionPyRaisable: java.sql.SQLException: The requested fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try again. Refer to the Athena documentation for valid fetchSize values.

潜在的解决方案包括:

在 Web GUI 中运行查询，然后手动下载结果集
在您选择的编辑器/IDE( DataGrip 、Athena Web GUI 等)中开发查询，并通过 Python SDK 将查询字符串传递给 Athena。然后，您可以等待查询完成并获取结果集。
您执行查询并对结果进行分页。
如果您从 Python 调用 SQL(我是从 PyCharm 标签推断的)，您可以使用像 PyAthenaJDBC 这样的库，它将为您处理页面大小调整(请参见上面的示例)。

对于我的许多 Python 脚本，我使用类似于以下的工作流程。

import boto3
import time

sql = 'SELECT * from athena_test.big_table'

database = 'SOMEDATABASE'
bucket_name = 'SOMEBUCKET' 
output_path = '/home/zerodf/temp/somedata.csv'

client = boto3.client('athena')
config = {'OutputLocation': 's3://' + bucket_name + '/',
      'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}}

execution_results = client.start_query_execution(QueryString = sql,
                                             QueryExecutionContext =
                                             {'Database': database},
                                             ResultConfiguration = config)

execution_id = str(execution_results[u'QueryExecutionId'])
remote_file = execution_id + '.csv'

while True:
    query_execution_results = client.get_query_execution(QueryExecutionId =
                                                     execution_id)
    if query_execution_results['QueryExecution']['Status']['State'] == u'SUCCEEDED':
        break
    else:
        time.sleep(60)

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file(remote_file, output_path)