I have this table:
我有一张桌子:
create table tab3(
id int not null auto_increment,
phrase text,
link_1 int,
link_2 int,
primary key (id),
foreign key (link_1) references tab1 (id),
foreign key (link_2) references tab2 (id));
I am inserting around 400k rows into this table with Python.
this is the insert statement:
我用Python在这个表中插入了大约400k行。以下是INSERT语句:
INSERT INTO tab3(phrase, link_1, link_2)
VALUES(
%s,
(select id from tab1 where tab1.col1 = %s),
(select id from tab2 where tab2.col2 = %s));
I have an index on both tables tab1.col1 and tab2.col2. but the insertion is taking a long time around 5mins/1000 row
我在两个表tab1.col1和tab2.col2上都有一个索引。但是插入大约需要5分钟/1000行的时间
I've tried many different techniques from the official docs of MySQL such as:
我从MySQL的官方文档中尝试了许多不同的技术,例如:
- using cursor.execute(stmt,param)
- using cursor.executemany(stmt, params)
- multiple processes ( billiard https://pypi.org/project/billiard/ )
- blocking the commit until all chunk of data is inserted and then commit changes
- encapsulating the insert stmt inside one transaction ( with START TRANSACTION )
But None of the above gave a good improvement.
但以上这些都没有带来很好的改善。
更多回答
You've set up your foreign key on tab3.link1 to reference tab1.id. But then you use a SELECT statement to get the id of the row which has the same link1 as the value you provide. Why not simplify this, and make link1 the primary key of tab1, and set up tab3.link1 as a foreign key of tab1.link1? Would make your insertion process simpler and likely faster.
您已经在tab3.Link1上设置了外键来引用tab1.id。但是,您可以使用SELECT语句获取与您提供的值具有相同Link1的行的ID。为什么不简化一下,将Link1设置为tab1的主键,并将tab3.Link1设置为tab1.Link1的外键?将使您的插入过程更简单,也可能更快。
@NickODell phrase column is text, which is not really suitable to be a primary key
@NickODell短语列是文本,不太适合作为主键
But the link_1 column is an int, no?
但是link_1列是一个整型,不是吗?
Which database do you use (MariaDB != MySQL) and which Python connector?
您使用哪个数据库(MariaDB!=MySQL)和哪个Python连接器?
@NickODell, thanks it was a typo I fix it ... tab3.link_1 refrence tab1.id and ta3.link_2 reference tab2.id
@Nickodell,谢谢,这是个打字错误,我把它修好了……Tab3.link_1引用tab1.id,ta3.link_2引用tab2.id
Since MySQL doesn't support native batch execution MySQL Connector/Python tries to rewrite the statement to a multiple value statement, e.g.
由于MySQL不支持本机批处理执行,MySQL连接器/Python尝试将语句重写为多值语句,例如
data= [(1,),(2,),(3,), ...(1000,)]
cursor.executemany("INSERT INTO t1 (a) VALUES(%s)", data)
will produce the statement:
将产生这样的声明:
INSERT INTO t1 (a) VALUES (1),(2),(3),...(1000)
With a subselect the statement will look like:
使用子选择,语句将如下所示:
INSERT INTO tab3(phrase, link_1, link_2)
VALUES("foo1", (select id from tab1 where tab1.col1 = int1_1),
(select id from tab2 where tab2.col2 = int2_1)),
VALUES("foo2", (select id from tab1 where tab1.col1 = int1_2),
(select id from tab2 where tab2.col2 = int2_2)),
...
VALUES("foo1000", (select id from tab1 where tab1.col1 = int1_1000),
(select id from tab2 where tab2.col2 = int2_1000)),
If you will run this statement with EXPLAIN
you will notice, that MySQL must execute 2000 subqueries, which is far away from optimal.
如果您将使用EXPLAIN运行此语句,您将注意到,MySQL必须执行2000个子查询,这远远不是最优的。
So instead create a temporary table and fill your data into temporary table:
因此,请创建一个临时表并将数据填充到临时表中:
cursor.execute("CREATE TEMPORARY TABLE lookup(phrase varchar(6000), col1 int unsigned, col2 int unsigned, unique index(col1), unique index(col2))")
cursor.executemany("INSERT INTO lookup VALUES (%s,%s,%s)", data);
Depending on the memory on the server you should also consider to use memory storage engine. Afterwards just insert the data from the temporary table:
根据服务器上的内存,您还应该考虑使用内存存储引擎。之后,只需插入临时表中的数据:
cursor.execute("INSERT INTO tab3 SELECT lookup.phrase, lookup.col1, lookup.col2 from lookup, tab1, tab2 where lookup.col2=tab2.col1 and lookup.col1=tab1.col1")
Some comments about your table definition:
关于您的表定义的一些备注:
- if the maximum size of phrase is 6000, then better use VARCHAR instead of TEXT.
- auto_increment columns should be defined as UNSIGNED.
更多回答
thanks for the answer. the insert statement in tab3 involves joining 3 tables which will introduce new delay
谢谢你的回答。表3中的INSERT语句涉及连接3个表,这将引入新的延迟
我是一名优秀的程序员,十分优秀!