gpt4 book ai didi

python - 使用 Python 3 和 SQLite 的批量插入性能较差

转载 作者:行者123 更新时间:2023-12-01 09:01:05 25 4
gpt4 key购买 nike

我有几个包含 URL 的文本文件。我正在尝试创建一个 SQLite 数据库来将这些 URL 存储在表中。 URL 表有两列,即主键(INTEGER)和 URL(TEXT)。

我尝试在一个插入命令中插入 100,000 个条目并循环,直到完成 URL 列表。基本上,读取所有文本文件内容并保存在列表中,然后我使用创建 100,000 个条目的较小列表并插入表中。

文本文件中的 URL 总数为 4,591,415 个,文本文件总大小约为 97.5 MB。

问题:

  1. 当我选择文件数据库时,插入大约需要7-7.5分钟。我觉得这不是一个非常快的插入,因为我有固态硬盘,它的读/写速度更快。除此之外,如任务管理器中所示,我还有大约 10GB 的可用 RAM。处理器是i5-6300U 2.4Ghz。

  2. 文本文件总数约为。 97.5 MB。但是在我将 URL 插入 SQLite 后,SQLite 数据库大约为 350MB,几乎是原始数据大小的 3.5 倍。由于数据库不包含任何其他表、索引等,因此这个数据库大小看起来有点奇怪。

对于问题 1,我尝试使用参数,并根据使用不同参数的测试运行得出了最佳参数。

table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
th, td {
padding: 15px;
text-align: left;
}
<table style="width:100%">
<tr>
<th>Configuration</th>
<th>Time</th>
</tr>

<tr><th>50,000 - with journal = delete and no transaction </th><th>0:12:09.888404</th></tr>
<tr><th>50,000 - with journal = delete and with transaction </th><th>0:22:43.613580</th></tr>
<tr><th>50,000 - with journal = memory and transaction </th><th>0:09:01.140017</th></tr>
<tr><th>50,000 - with journal = memory </th><th>0:07:38.820148</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0 </th><th>0:07:43.587135</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=65535 </th><th>0:07:19.778217</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0 and page_size=65535 </th><th>0:07:28.186541</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:06.539198</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=0 and page_size=65535 </th><th>0:07:19.810333</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=0 and page_size=65535 </th><th>0:08:22.856690</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=1 and page_size=65535 </th><th>0:08:22.326936</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=4096 </th><th>0:07:35.365883</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=4096 </th><th>0:07:15.183948</th></tr>
<tr><th>1,00,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:13.402985</th></tr>



</table>

我在网上查了一下,看到了这个链接https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/系统比我慢得多,但仍然表现得很好。此链接中突出的两件事是:

  1. 链接中的表格的列数比我的列数还要多。
  2. 数据库文件没有增长 3.5 倍。

我在这里分享了 python 代码和文件:https://github.com/ksinghgithub/python_sqlite

有人可以指导我优化这段代码吗?谢谢。

环境:

  1. i5-6300U 上的 Windows 10 专业版、20GB RAM 和 512 SSD。
  2. Python 3.7.0

编辑 1::基于 UNIQUE 约束收到的反馈和我使用缓存大小值的新性能图表。

self.db.execute('CREATE TABLE blacklist (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL UNIQUE)')

table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
th, td {
padding: 15px;
text-align: left;
}
<table>
<tr>
<th>Configuration</th>
<th>Action</th>
<th>Time</th>
<th>Notes</th>
</tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>

</table>

最佳答案

SQLite 默认使用自动提交模式。这允许省略begin transaction。但这里我们希望所有插入都在一个事务中,唯一的方法是使用 begin transaction 启动一个事务,以便所有要运行的语句都在该事务中.

方法 executemany 只是在 Python 外部完成的 execute 循环,仅调用一次 SQLite 准备语句函数。

以下是从列表中删除最后 N 项的非常糟糕的方法:

    templist = []
i = 0
while i < self.bulk_insert_entries and len(urls) > 0:
templist.append(urls.pop())
i += 1

最好这样做:

   templist = urls[-self.bulk_insert_entries:]
del urls[-self.bulk_insert_entries:]
i = len(templist)

切片和删除切片甚至可以在空列表上工作。

两者可能具有相同的复杂性,但 100K 次调用追加和弹出比让 Python 在解释器之外执行它的成本要高得多。

关于python - 使用 Python 3 和 SQLite 的批量插入性能较差,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52451119/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com