gpt4 book ai didi

mysql - 更新 MySQL 报告受影响的行过多的 Python 脚本

转载 作者:行者123 更新时间:2023-11-29 05:58:29 24 4
gpt4 key购买 nike

早上好

我目前正在使用 Python 编写 MySQL 更新脚本,它将从一个表中提取一个值并使用该值更新另一个表。更新是分批完成的,这是可配置的,但目前以每次更新 50K 的速度运行。复制在每次更新之间进行检查,每次检查前都会有一个小的暂停(0.01 秒)。我遇到了以下两个问题:

  1. 该查询最初和大量迭代的性能都可以接受。随着脚本运行,处理每个更新的时间无限增长,直到系统似乎停止处理所有更新。我们从每次更新大约 2-3 秒开始,在最后一次尝试时增加到 4 分钟以上,大约在 4-6 小时的迭代之后。 MySQL 服务器仍在响应,但是空闲缓冲区为 1(不确定是否相关)并且没有进一步的更新正在处理,尽管它们记录在脚本日志文件中。

  2. 通过全局变量收集的受影响行数远大于系统中已更新的实际行数。我的过程是检查日志文件中的交互次数,乘以 50K,这应该是受影响的行,但是,计数要低得多。

脚本如下:

#!/usr/bin/python2.7

# Script Process:
# Py Script to populate all result.orderedPanelOid items from the ordered_panel_result_rel table.
# The UPDATEs are batched in 50K chunks, with replication checks between each batch.
# The largest tenant has about 1.3B result rows and can process a batch in 9-11 seconds.

import mysql.connector,time

#Core Logins
hcmaster_username = 'user_name'
hcmaster_password = 'pw'
hcmaster_hostname = 'host'
hcmaster_database = 'db'

#Tenant Logins
username = 'user_name'
password = 'pw'
hostname_master = 'master_host'
hostname_slave = 'slave_host'

#Define Class
class MySQLCursorDict(mysql.connector.cursor.MySQLCursor):
def _row_to_python(self, rowdata, desc=None):
row = super(MySQLCursorDict, self)._row_to_python(rowdata, desc)
if row:
return dict(zip(self.column_names, row))
return None

#Define Record Count
def get_record_count(db_connection, query):
count_cursor = db_connection.cursor()
count_cursor.execute(query % "SELECT COUNT(*)")
return count_cursor.fetchall()[0][0]

#Define Check Replication Delay
def check_replication_delay( mysql_connection ):
id_cursor = mysql_connection.cursor( cursor_class = MySQLCursorDict )
id_cursor.execute("show slave status")
try:
slave_status = id_cursor.fetchall()[0]
print "%s delay is %d" % ( hostname_slave, int(slave_status['Seconds_Behind_Master']))
return int(slave_status['Seconds_Behind_Master'])
except:
return 0

#Define UPDATE sequence
def update_sequence():

# Check Replication
time.sleep(.01)
while (check_replication_delay(slave) > 30):
print "Delay is over 30 seconds on RO Slave. Sleeping for 15 seconds"
time.sleep(15)

# Run Query to UPDATE in batches
#The inner join is req. so that only rows in result with a corresponding rel row are attempted to be updated
#without the inner join the script loops endlessly because there is no rel value to update result with.
tenant_cursor.execute(
"UPDATE %s.result rm inner join (select rm1.id,rm1.OrderedPanelOid,rr1.ResultId,rr1.OrderedPanelId FROM %s.result rm1 inner join %s.ordered_panel_result_rel rr1 on rm1.id = rr1.ResultId WHERE rm1.OrderedPanelOid IS NULL ORDER BY rm1.id LIMIT 50000) as SOURCE ON rm.id = SOURCE.id SET rm.OrderedPanelOid = SOURCE.orderedPanelId WHERE rm.id = SOURCE.ResultId" % (database_encoded, database_encoded, database_encoded))

# Get affected Rows
rows_affected = tenant_cursor.rowcount
global rowsaffected
rowsaffected = rows_affected
print rows_affected

# Commit the UPDATEs
tenant.commit()

#Collect All Databases
try:
#Production Core Connection
db = mysql.connector.connect(user=user_name, password=pw, host=host, database=db, ssl_ca='/file/location')

#Establish hcmaster Cursor Connections
hc_cursor = db.cursor()

#Get Database List -- Change this out when we go full tenant list!
query = "select databaseName from tenant_environment where databaseName IN ('db1','db2');"
hc_cursor.execute(query)
databases = hc_cursor.fetchall()
tenant_list = list(zip(*databases)[0])
tenant_listStr = str(tenant_list)

print "Databases to execute upon: " + tenant_listStr

except Exception as err:
e = str(err)
print "Something has gone wrong in collecting databases! " + e

for database in databases:
databaseStr = str(database)
database_encoded = databaseStr.strip()[3:-3]
print "Trying " + database_encoded

# Production Connections
tenant = mysql.connector.connect(user=user_name, password=pw, host=master_host, database=database_encoded,ssl_ca='/file/location')
slave = mysql.connector.connect(user=user_name, password=pw, host=slave_host, database=database_encoded,ssl_ca='/file/location')

# Establish Cursor Connections
tenant_cursor = tenant.cursor()

# Collect Start Time
print database_encoded + " start time: " + time.strftime("%Y-%m-%d %H:%M:%S")

try:
update_sequence()
while rowsaffected > 0:
print "Number of Rows Affected: " + str(rowsaffected) + " at " + time.strftime("%H:%M:%S")
update_sequence()
else:
print "Number of Rows Affected: 0"

except Exception as err:
e = str(err)
print database_encoded + ".result.orderedPanelOid was not populated! Process failed! " + e

#Collect End Time
print database_encoded + " completion time: " + time.strftime("%Y-%m-%d %H:%M:%S") + "\n"

我将不胜感激任何帮助或想法,因为我在这两个问题上都有点难过。我是一名新的 DBA,但到目前为止已经编写了一些 python 脚本。 MySQL版本为5.6.34,使用python 2.7


查询,格式化为可读性:

 UPDATE %s.result rm
inner join (select rm1.id,rm1.OrderedPanelOid,
rr1.ResultId,rr1.OrderedPanelId
FROM %s.result rm1
inner join %s.ordered_panel_result_rel rr1 on rm1.id = rr1.ResultId
WHERE rm1.OrderedPanelOid IS NULL
ORDER BY rm1.id
LIMIT 50000
) as SOURCE ON rm.id = SOURCE.id
SET rm.OrderedPanelOid = SOURCE.orderedPanelId
WHERE rm.id = SOURCE.ResultId

创建表语句:

CREATE TABLE `result` (
`id` decimal(30,0) NOT NULL,
`v` bigint(20) DEFAULT NULL,
`createStamp` timestamp NULL DEFAULT NULL,
`modifiedStamp` timestamp NULL DEFAULT NULL,
`createdBy` decimal(30,0) DEFAULT NULL,
`modifiedBy` decimal(30,0) DEFAULT NULL,
`active` tinyint(1) NOT NULL,
`testData` tinyint(1) NOT NULL DEFAULT '0',
`testOid` decimal(30,0) DEFAULT NULL,
`orderedPanelOid` decimal(30,0) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`code` varchar(255) DEFAULT NULL,
`result` longtext,
`numericResult` double DEFAULT NULL,
`comment` longtext,
`description` longtext,
`status` varchar(100) DEFAULT NULL,
`units` varchar(255) DEFAULT NULL,
`refRange` varchar(255) DEFAULT NULL,
`refRangeLowNumeric` double DEFAULT NULL,
`refRangeHighNumeric` double DEFAULT NULL,
`patientOid` decimal(30,0) DEFAULT NULL,
`labOrderOid` decimal(30,0) DEFAULT NULL,
`placeOrderNumber` varchar(255) DEFAULT NULL,
`criticality` varchar(100) DEFAULT NULL,
`conclusionTypeValueId` decimal(30,0) DEFAULT NULL,
`runDate` datetime DEFAULT NULL,
`resultDate` datetime DEFAULT NULL,
`enteredByOid` decimal(30,0) DEFAULT NULL,
`signedByOid` decimal(30,0) DEFAULT NULL,
`row_modified_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`resultLocalDate` datetime DEFAULT NULL,
`runLocalDate` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_resultDate` (`resultDate`),
KEY `idx_runDate` (`runDate`),
KEY `idx_name` (`name`),
KEY `idx_code` (`code`),
KEY `idx_patientOid` (`patientOid`),
KEY `idx_labOrderOid` (`labOrderOid`),
KEY `idx_placeOrderNumber` (`placeOrderNumber`),
KEY `idx_orderedPanelOid` (`orderedPanelOid`),
KEY `idx_enteredByOid` (`enteredByOid`),
KEY `idx_testOid` (`testOid`),
KEY `idx_active` (`active`),
KEY `idx_signedByOid` (`signedByOid`),
KEY `idx_row_modified_at` (`row_modified_at`),
KEY `idx_resultLocalDate` (`resultLocalDate`),
KEY `idx_runLocalDate` (`runLocalDate`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `ordered_panel_result_rel` (
`id` decimal(30,0) NOT NULL,
`orderedPanelId` decimal(30,0) NOT NULL,
`resultId` decimal(30,0) NOT NULL,
`row_modified_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`),
UNIQUE KEY `idx_uqopResultIds` (`orderedPanelId`,`resultId`),
KEY `SET` (`orderedPanelId`,`resultId`),
KEY `OPANELRESRELID` (`orderedPanelId`),
KEY `idx_orderedPanelId` (`orderedPanelId`),
KEY `idx_resultId` (`resultId`),
KEY `idx_row_modified_at` (`row_modified_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

最佳答案

我猜您正在寻找需要修复的 50K 行。但是在你固定了一批之后,下一批在表中更远?这只会变得越来越慢。

解决方法是改为根据 PRIMARY KEY 遍历表。获取 50K 行的 block ,修复许多需要修复的 50K(可能小于 50K)。这将从“越来越慢”变为每个 block 的恒定、快速的速度。

我在这里讨论这种分块的细节:http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks

其他问题...

  • 单个 block 中 50K 可能太多了,尤其是在使用复制的情况下。
  • 为什么是 DECIMAL(30,0)?它很大(14 字节)并且可能比需要的大很多。参见 INT 及其 friend 。
  • 您是否有多个“相同的”DATABASE?除非您托管某种服务,否则这通常是糟糕的架构设计。
  • 标记上的单列索引(例如 active)很少有用。
  • 根据定义(在 MySQL 中),PRIMARY KEYUNIQUE;不要重新索引。
  • 如果 ordered_pa​​nel_result_rel 是多对多关系表,请参阅 http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table 中的提示
  • 两个带有模式的表都有一个 PRIMARY KEY(id) 不是 AUTO_INCREMENT;您是手动设置值吗?

关于mysql - 更新 MySQL 报告受影响的行过多的 Python 脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47204988/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com