gpt4 book ai didi

python - 高效地将 Pandas 数据框写入 Google BigQuery

转载 作者:太空狗 更新时间:2023-10-29 20:26:58 26 4
gpt4 key购买 nike

我正在尝试使用记录在案的 pandas.DataFrame.to_gbq() 函数将 pandas.DataFrame 上传到 Google Big Query here .问题是 to_gbq() 需要 2.3 分钟,而直接上传到 Google Cloud Storage 需要不到一分钟。我打算上传一堆数据帧 (~32),每个数据帧的大小都差不多,所以我想知道什么是更快的选择。

这是我正在使用的脚本:

dataframe.to_gbq('my_dataset.my_table', 
'my_project_id',
chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
if_exists='append',
verbose=False
)

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

我的问题是,什么更快?

  1. 使用pandas.DataFrame.to_gbq() 函数上传Dataframe
  2. Dataframe 保存为 CSV,然后使用 Python API 将其作为文件上传到 BigQuery
  3. Dataframe 保存为 CSV,然后使用 this procedure 将文件上传到 Google Cloud Storage然后从 BigQuery 读取它

更新:

备选方案 1 似乎比备选方案 2 快,(使用 pd.DataFrame.to_csv()load_data_from_file() 17.9 秒3 个循环平均更多):

def load_data_from_file(dataset_id, table_id, source_file_name):
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)

with open(source_file_name, 'rb') as source_file:
# This example uses CSV, but you can use other formats.
# See https://cloud.google.com/bigquery/loading-data
job_config = bigquery.LoadJobConfig()
job_config.source_format = 'text/csv'
job_config.autodetect=True
job = bigquery_client.load_table_from_file(
source_file, table_ref, job_config=job_config)

job.result() # Waits for job to complete

print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_id, table_id))

最佳答案

我在 Datalab 中使用以下代码对备选方案 1 和 3 进行了比较:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',
Context.default().project_id,
chunksize=10000,
if_exists='append',
verbose=False
)
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

这里是 n = {10000,100000,1000000} 的结果:

n       alternative_1  alternative_3
10000 30.72s 8.14s
100000 162.43s 70.64s
1000000 1473.57s 688.59s

从结果来看,备选方案3比备选方案1更快。

关于python - 高效地将 Pandas 数据框写入 Google BigQuery,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48886761/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com