gpt4 book ai didi

postgresql - 将 pandas 数据框批量插入 PostgreSQL 表的最佳方法

转载 作者:行者123 更新时间:2023-11-29 12:02:36 25 4
gpt4 key购买 nike

我需要将多个 excel 文件上传到 postgresql 表,但它们可以在多个寄存器中相互重叠,因此我需要注意 IntegrityErrors。我遵循两种方法:

cursor.copy_from:最快的方法,但我不知道如何捕获和控制所有由于重复寄存器引起的完整性错误

streamCSV = StringIO()
streamCSV.write(invoicing_info.to_csv(index=None, header=None, sep=';'))
streamCSV.seek(0)

with conn.cursor() as c:
c.copy_from(streamCSV, "staging.table_name", columns=dataframe.columns, sep=';')
conn.commit()

cursor.execute:我可以计算并处理每个异常,但它非常慢的。

data = invoicing_info.to_dict(orient='records')

with cursor as c:
for entry in data:
try:
c.execute(DLL_INSERT, entry)
successful_inserts += 1
connection.commit()
print('Successful insert. Operation number {}'.format(successful_inserts))
except psycopg2.IntegrityError as duplicate:
duplicate_registers += 1
connection.rollback()
print('Duplicate entry. Operation number {}'.format(duplicate_registers))

在例程结束时,我需要确定以下信息:

print("Initial shape: {}".format(invoicing_info.shape))
print("Successful inserts: {}".format(successful_inserts))
print("Duplicate entries: {}".format(duplicate_registers))

如何修改第一种方法来控制所有异常?如何优化第二种方法?

最佳答案

当您在不同的 Excel 工作表中有重复的 ID 时,您必须自己回答如何决定信任哪个 Excel 工作表中的数据?

当您使用多个表时,将使用方法从冲突对中获取至少一行,您始终可以执行以下操作:

  • 为每个 Excel 工作表创建临时表
  • 将数据上传到 Excel 工作表的每个表格(就像您现在批量上传一样)
  • 以某种方式从 select picking distinct on(id) 插入一个:

INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, col1, col2
FROM
(
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data

使用这样的插入 postgreSQL 将从非唯一 ID 集中取出随机行。

如果您想信任第一条记录,您可以添加一些顺序:

INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, ordering_column col1, col2
FROM
(
SELECT id, 1 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, 2 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, 3 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data
ORDER BY ordering_column

对于对象的初始计数:

SELECT sum(count)
FROM
(
SELECT count(*) as count FROM temp_table_for_excel_sheet1
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet2
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet3
) as data

完成此批量插入后,您可以运行 select count(*) FROM staging.table_name 以获得插入记录总数的结果

对于重复计数,您可以运行:

SELECT sum(count)
FROM
(
SELECT count(*) as count
FROM temp_table_for_excel_sheet2 WHERE id in (select id FROM temp_table_for_excel_sheet1 )

UNION

SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
)

UNION

SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet2 )
) as data

关于postgresql - 将 pandas 数据框批量插入 PostgreSQL 表的最佳方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48752716/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com