gpt4 book ai didi

python - SQLite 中的连接语句出现问题

转载 作者:IT王子 更新时间:2023-10-29 06:32:14 25 4
gpt4 key购买 nike

我有两个正在处理的数据文件。一个包含单词列表以及关于这些单词的一些附加信息,另一个包含单词对(其中单词按第一个表中的单词 ID 列出)及其频率。

词典文件(示例输出)

('wID', 'w1', 'w1cs', 'L1', 'c1')
('-----', '-----', '-----', '-----', '-----')
(1, ',', ',', ',', 'y')
(2, '.', '.', '.', 'y')
(3, 'the', 'the', 'the', 'at')
(4, 'and', 'and', 'and', 'cc')
(5, 'of', 'of', 'of', 'io')

二元文件(示例输出)

('freq', 'w1', 'w2')
(4, 22097, 161)
(1, 98664, 1320)
(1, 426515, 1345)
(1, 483675, 747)
(19, 63, 15496)
(2, 3011, 7944)
(1, 27985, 27778)

我使用 SQLite 创建了两个表并从上面的文件上传了数据。

conn = sqlite3.connect('bigrams.db')
conn.text_factory = str
c = conn.cursor()
c.execute('pragma foreign_keys=ON')

词典表

c.execute('''CREATE TABLE lex
(wID INT PRIMARY KEY, w1 TEXT, w1cs TEXT, L1 TEXT, c1 TEXT)''')

#I removed this index as per CL.'s suggestion
#c.execute('''DROP INDEX IF EXISTS lex_index''')
#c.execute('''CREATE INDEX lex_index ON lex (wID, w1, c1)''')

#and added this one
c.execute('''CREATE INDEX lex_w1_index ON lex (w1)''')

向词典表中插入数据

#I replaced this code
# with open('/Users/.../lexicon.txt', "rb") as lex_file:
# for line in lex_file:
# currentRow = line.split('\t')
# try:
# data = [currentRow[0], currentRow[1], currentRow[2], currentRow[3], str(currentRow[4].strip('\r\n'))]
# c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', (data,))
# except IndexError:
# pass


#with the one that Julian wrote

blocksize = 100000

with open('/Users/.../lexicon.txt', "rb") as lex_file:
data = []
line_counter = 0
for line in lex_file:
data.append(line.strip().split('\t'))
line_counter += 1
if line_counter % blocksize == 0:
try:
c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', data)
conn.commit()
except IndexError:
block_start = line_counter - blocksize + 1
print 'Lex error lines {}-{}'.format(block_start, line_counter)
finally:
data = []

二元组表

#I replaced this code to create table x2 
#c.execute('''CREATE TABLE x2
# (freq INT, w1 INT, w2 INT, FOREIGN KEY(w1) REFERENCES lex(wID), FOREIGN KEY(w2) REFERENCES lex(wID))''')

#with the code that Julian suggested
c.execute('''CREATE TABLE x2
(freq INT, w1 INT, w2 INT,
FOREIGN KEY(w1) REFERENCES lex(wID),
FOREIGN KEY(w2) REFERENCES lex(wID),
PRIMARY KEY(w1, w2) )''')

向二元组表中插入数据

#Replaced this code
#with open('/Users/.../x2.txt', "rb") as x2_file:
# for line in x2_file:
# currentRow = line.split('\t')
# try:
# data = [str(currentRow[0].replace('\x00','').replace('\xff\xfe','')), str(currentRow[1].replace('\x00','')), str(currentRow[2].replace('\x00','').strip('\r\n'))]
# c.executemany('insert or replace into x2 values (?, ?, ?)', (data,))
# except IndexError:
# pass

#with this one suggested by Julian
with open('/Users/.../x2.txt', "rb") as x2_file:
data = []
line_counter = 0
for line in x2_file:
data.append(line.strip().replace('\x00','').replace('\xff\xfe','').split('\t'))
line_counter += 1
if line_counter % blocksize == 0:
try:
c.executemany('insert or replace into x2 values (?, ?, ?)', data)
conn.commit()
except IndexError:
block_start = line_counter - blocksize + 1
print 'x2 error lines {}-{}'.format(block_start, line_counter)
finally:
data = []

conn.close()

我希望能够检查数据中是否存在给定的词对——例如“like new”

当我只指定第一个词时,程序运行正常。

cur.execute('''SELECT lex1.w1, lex2.w1 from x2 
INNER JOIN lex as lex1 ON lex1.wID=x2.w1
INNER JOIN lex as lex2 ON lex2.wID=x2.w2
WHERE lex1.w1= “like” ’’’)

但是当我想搜索一对单词时,代码慢得令人痛苦。

cur.execute('''SELECT lex1.w1, lex2.w1 from x2 
INNER JOIN lex as lex1 ON lex1.wID=x2.w1
INNER JOIN lex as lex2 ON lex2.wID=x2.w2
WHERE lex1.w1=“like” AND lex2.w1= “new” ''')

我不知道我做错了什么。任何帮助将非常感激。

最佳答案

EXPLAIN QUERY PLAN显示数据库首先扫描了x2表,然后为每个x2行查找对应的lex行,检查单词是否匹配。lex 查找是使用临时索引完成的,但是对 x2 中的每一行执行两次此查找仍然会使整个查询变慢。

如果数据库能先查出这两个词的ID,然后在x2中查找有这两个ID的行,查询会很快。这需要一些新的索引。(lex_index 索引仅对从 wID 列开始的查找有用(并且此类查找可能已经使用了主键索引)。)

您需要创建一个允许搜索 w1 的索引:

CREATE INDEX lex_w1_index ON lex(w1);

要查找包含这两个单词 ID 的任何 x2 行,您需要在最左侧位置对这两列进行一些索引:

CREATE INDEX x2_w1_w2_index ON x2(w1, w2);

或者,将这两列作为主索引(参见 Julian 的回答)。


要强制数据库首先进行单词 ID 查找,您可以将它们移动到子查询中:

SELECT freq
FROM x2
WHERE w1 = (SELECT wID FROM lex WHERE w1 = 'like')
AND w2 = (SELECT wID FROM lex WHERE w1 = 'new')

然而,这不是必须的;使用新索引,优化器应该能够自动找到最佳查询计划。 (但如果您认为它更具可读性,您仍然可以使用此查询。)

关于python - SQLite 中的连接语句出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24841445/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com